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Abstract 

Communication,  i.e.,  moving  data,  between  levels  of  a  memory  hierarchy  or  between  parallel  processors  on 
a  network,  can  greatly  dominate  the  cost  of  computation,  so  algorithms  that  minimize  communication  can  run 
much  faster  (and  use  less  energy)  than  algorithms  that  do  not.  Motivated  by  this,  attainable  communication 
lower  bounds  were  established  in  [12,  13,  4]  for  a  variety  of  algorithms  including  matrix  computations.  The  lower 
bound  approach  used  initially  in  [13]  for  Q(N3)  matrix  multiplication,  and  later  in  [4]  for  many  other  linear 
algebra  algorithms,  depended  on  a  geometric  result  by  Loomis  and  Whitney  [16] :  this  result  bounded  the  volume 
of  a  3D  set  (representing  multiply-adds  done  in  the  inner  loop  of  the  algorithm)  using  the  product  of  the  areas  of 
certain  2D  projections  of  this  set  (representing  the  matrix  entries  available  locally,  i.e.,  without  communication). 
Using  a  recent  generalization  of  Loomis’  and  Whitney’s  result,  we  generalize  this  lower  bound  approach  to  a 
much  larger  class  of  algorithms,  that  may  have  arbitrary  numbers  of  loops  and  arrays  with  arbitrary  dimensions, 
as  long  as  the  index  expressions  are  affine  combinations  of  loop  variables.  In  other  words,  the  algorithm  can  do 
arbitrary  operations  on  any  number  of  variables  like  A(*i,  *2,  *2  —  2*i,  3  —  4*3  +  7*4, . . .).  Moreover,  the  result 
applies  to  recursive  programs,  irregular  iteration  spaces,  sparse  matrices,  and  other  data  structures  as  long  as 
the  computation  can  be  logically  mapped  to  loops  and  indexed  data  structure  accesses.  We  also  discuss  when 
optimal  algorithms  exist  that  attain  the  lower  bounds;  this  leads  to  new  asymptotically  faster  algorithms  for 
several  problems. 


1  Introduction 

Algorithms  have  two  costs:  computation  (e.g.,  arithmetic)  and  communication,  i.e.,  moving  data  between  levels 
of  a  memory  hierarchy  or  between  processors  over  a  network.  Communication  costs  (measured  in  time  or  energy 
per  operation)  already  greatly  exceed  computation  costs,  and  the  gap  is  growing  over  time  following  technological 
trends  [10,  9].  Thus  it  is  important  to  design  algorithms  that  minimize  communication,  and  if  possible  attain 
communication  lower  bounds.  In  this  work,  we  measure  communication  cost  in  terms  of  the  number  of  words  moved 
(a  bandwidth  cost),  and  will  not  discuss  other  factors  like  per-message  latency,  congestion,  or  costs  associated  with 
noncontiguous  data.  Our  goal  here  is  to  establish  new  lower  bounds  on  the  communication  cost  of  a  much  broader 
class  of  algorithms  than  possible  before,  and  when  possible  describe  how  to  attain  these  lower  bounds. 

Communication  lower  bounds  have  been  a  subject  of  research  for  a  long  time.  Hong  and  Kung  [12]  used  an 
approach  called  pebbling  to  establish  lower  bounds  for  @(iV3)  matrix  multiplication  and  other  algorithms.  Irony, 
Tiskin  and  Toledo  [13]  proved  the  result  for  Q(N3)  matrix  multiplication  in  a  different,  geometric  way,  and  extended 
the  results  both  to  the  parallel  case  and  the  case  of  using  redundant  copies  of  the  data.  In  [4]  this  geometric  approach 
was  further  generalized  to  include  any  algorithm  that  “geometrically  resembled”  matrix  multiplication  in  a  sense 
to  be  made  clear  later,  but  included  most  direct  linear  algebra  algorithms  (dense  or  sparse,  sequential  or  parallel), 
and  some  graph  algorithms  as  well.  Of  course  lower  bounds  alone  are  not  algorithms,  so  a  great  deal  of  additional 
work  has  gone  into  developing  algorithms  that  attain  these  lower  bounds,  resulting  in  many  faster  algorithms  as 
well  as  remaining  open  problems  (we  discuss  attainability  in  Section  7). 

Our  geometric  approach,  following  [13,  4],  works  as  follows.  We  have  a  set  Z  of  arithmetic  operations  to  perform, 
and  the  amount  of  data  available  locally,  i.e.,  without  any  communication,  is  M  words.  For  example,  M  could  be 
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the  cache  size.  Suppose  we  can  upper  bound  the  number  of  (useful)  arithmetic  operations  that  we  can  perform 
with  just  this  data;  call  this  bound  F.  Letting  j  •  |  denote  the  cardinality  of  a  set,  if  the  total  number  of  arithmetic 
operations  that  we  need  to  perform  is  \Z\  (e.g.,  \Z\  =  N 3  multiply-adds  in  the  case  of  dense  matrix  multiplication 
on  one  processor),  then  we  need  to  refill  the  cache  at  least  \Z\/F  times  in  order  to  perform  all  the  operations.  Since 
refilling  the  cache  has  a  communication  cost  of  moving  O(M)  words  (e.g.,  writing  at  most  M  words  from  cache 
back  to  slow  memory,  and  reading  at  most  M  new  words  into  cache  from  slow  memory),  the  total  communication 
cost  is  f i(M  ■  \Z\/F)  words  moved.  This  argument  is  formalized  in  Section  4. 

The  most  challenging  part  of  this  argument  is  determining  F.  Our  approach,  described  in  more  detail  in 
Sections  2-3,  builds  on  the  work  in  [13,  4];  in  those  papers,  the  algorithm  is  modeled  geometrically  using  the 
iteration  space  of  the  loop  nest,  as  sketched  in  the  following  example. 

Example:  Matrix  Multiplication  (Part  1/51).  For  N-  by -N  matrix  multiplication,  with  3  nested  loops  over 
indices  (ii,  *21*3)1  any  subset  E  of  the  N3  inner  loop  iterations  C(i\,i2)  +=  A(ii,i3)  ■  B(i3,i2)  can  be  modeled  as 
a  subset  of  an  N-  by-N-by-N  discrete  cube  Z  =  {1, . . .  ,N}3  C  Z3 .  The  data  needed  to  execute  a  subset  of  this 
cube  is  modeled  by  its  projections  onto  faces  of  the  cube,  i.e.,  (*i , *2) ,  (*1 , *3)  and  (*3,  *2)  for  each  (ii,i2, *3)  €E  E. 
F  is  the  maximum  number  of  points  whose  projections  are  of  size  at  most  O(M).  In  [13,  4]  the  authors  used 
a  geometric  theorem  of  Loomis  and  Whitney  (see  Theorem  2.1,  a  special  case  of  [16,  Theorem  2])  to  show  F  is 
maximized  when  E  is  a  cube  with  edge  length  M 1/2,  so  F  =  0(M 3/2),  yielding  the  communication  lower  bound 
fi(M  •  \Z\/F)  =  f 1(M  •  N3/M 3/2)  =  fl(7V3/M1/2).  A 

Our  approach  is  based  on  a  major  generalization  of  [16]  in  [6]  that  lets  us  geometrically  model  a  much  larger  class 
of  algorithms  with  an  arbitrary  number  of  loops  and  array  expressions  involving  affine  functions  of  indices. 

We  outline  our  results: 

Section  2  introduces  the  geometric  model  that  we  use  to  compute  the  bound  F  (introduced  above).  We  first 
describe  the  model  for  matrix  multiplication  (as  used  in  [13,  4])  and  apply  Theorem  2.1  to  obtain  a  bound 
of  the  form  F  =  0(M3/2).  Then  we  describe  how  to  generalize  the  geometric  model  to  any  program  that 
accesses  arrays  inside  loops  with  subscripts  that  are  affine  functions  of  the  loop  indices,  such  as  A(ii,i2), 
B(i2 ,*3),  C(i  1  +  3 i2, 1  —  i\  +  7i2  +  8 i3, . . .),  etc.  (More  generally,  there  do  not  need  to  be  explicit  loops  01- 
array  references;  see  Section  4  for  the  general  case.)  In  this  case,  we  seek  a  bound  of  the  form  F  =  0(MSHBL) 
for  some  constant  shbl- 

Section  3  takes  the  geometric  model  and  in  Theorem  3.2  proves  a  bound  yielding  the  desired  shbl-  HBL  stands  for 
Holder-Brascamp-Lieb,  after  the  authors  of  precedents  and  generalizations  of  [16].  In  particular,  Theorem  3.2 
gives  the  constraints  for  a  linear  program  with  integer  coefficients  whose  solution  is  shbl- 

Section  4  formalizes  the  argument  that  an  upper  bound  on  F  yields  a  lower  bound  on  communication  of  the  form 
Q(M ■  \  Z\/F).  Some  technical  assumptions  are  required  for  this  to  work.  Using  the  same  approach  as  in  [4],  we 
need  to  eliminate  the  possibility  that  an  algorithm  could  do  an  unbounded  amount  of  work  on  a  fixed  amount 
of  data  without  requiring  any  communication.  We  also  describe  how  the  bound  applies  to  communication  in 
a  variety  of  computer  architectures  (sequential,  parallel,  heterogeneous,  etc.). 

Section  5.  Theorem  3.2  proves  the  existence  of  a  certain  set  of  linear  constraints,  but  does  not  provide  an  algorithm 
for  writing  it  down.  Theorem  5.1  shows  we  can  always  compute  a  linear  program  with  the  same  feasible  region; 
i.e.,  the  lower  bounds  discussed  in  this  work  are  decidable.  Interestingly,  it  may  be  undecidable  to  compute 
the  exact  set  of  constraints  given  by  Theorem  3.2:  Theorem  5.9  shows  that  having  such  an  effective  algorithm 
for  an  arbitrary  set  of  array  subscripts  (that  are  affine  functions  of  the  loop  indices)  is  equivalent  to  a  positive 
solution  to  Hilbert’s  Tenth  Problem  over  Q,  i.e.,  deciding  whether  a  system  of  rational  polynomial  equations 
has  a  rational  solution.  Decidability  over  Q  is  a  longstanding  open  question;  this  is  to  be  contrasted  with  the 
problem  over  Z,  proven  undecidable  by  Matiyasevich-Davis-Putnam- Robinson  [17],  or  with  the  problem  over 
R,  which  is  Tarski-decidable  [21]. 

Section  6.  While  Theorem  5.1  demonstrates  that  our  lower  bounds  are  decidable,  the  algorithm  given  is  quite 
inefficient.  In  many  cases  in  many  cases  of  practical  interest  we  can  write  down  an  equivalent  linear  program 
in  fewer  steps;  in  Section  6  we  present  several  such  cases  (Part  2  of  this  paper  will  discuss  others).  For 
example,  when  every  array  is  subscripted  by  a  subset  of  the  loop  indices,  Theorem  6.6  shows  we  can  obtain 
an  equivalent  linear  program  immediately.  For  example,  this  includes  matrix  multiplication,  which  has  three 
loop  indices  i\ ,  i2 ,  and  i3,  and  array  references  A(ii,i3),  B(i3,  i2)  and  C(ii,i2)-  Other  examples  include  tensor 
contractions,  the  direct  N-body  algorithm,  and  database  join. 

1This  indicates  that  this  is  the  first  of  5  times  that  matrix  multiplication  will  be  used  as  an  example. 


2 


Section  7  considers  when  the  communication  lower  bound  for  an  algorithm  is  attainable  by  reordering  the  exe¬ 
cutions  of  the  inner  loop;  we  call  such  an  algorithm  communication  optimal.  For  simplicity,  we  consider  the 
case  where  all  reorderings  are  correct.  Then,  for  the  case  discussed  in  Theorem  6.6,  i.e. ,  when  every  array  is 
subscripted  by  a  subset  of  the  loop  indices,  Theorem  7.1  shows  that  the  dual  linear  program  yields  the  “block 
sizes”  needed  for  a  communication-optimal  sequential  algorithm.  Section  7.3  uses  Theorem  6.6  to  do  the  same 
for  a  parallel  algorithm,  providing  a  generalization  of  the  “2.5D”  matrix  algorithms  in  [20].  Other  examples 
again  include  tensor  contractions,  the  direct  N-body  algorithm,  and  database  join. 

Section  8  summarizes  our  results,  and  outlines  the  contents  of  Part  2  of  this  paper.  Part  2  will  discuss  how  to 
compute  lower  bounds  more  efficiently  and  will  include  more  cases  where  optimal  algorithms  are  possible, 
including  a  discussion  of  loop  dependencies. 

This  completes  the  outline  of  the  paper.  We  note  that  it  is  possible  to  omit  the  detailed  proofs  in  Sections  3  and  5 
on  a  first  reading;  the  rest  of  the  paper  is  self-contained. 

To  conclude  this  introduction,  we  apply  our  theory  to  two  examples  (revisited  later),  and  show  how  to  derive 
communication-optimal  sequential  algorithms. 

Example:  Matrix  Multiplication  (Part  2/5).  The  code  for  computing  C  =  A  ■  B  is 

for  *i  =  l:  iV,  for  *2  =  1:  N,  for  i3  =  1  :  TV, 

C(h,i2)  +=  A(ix,i3)  •  B(i3,i2) 

(We  omit  the  details  of  replacing  ‘+=!  by  ‘=’  when  *3  =  1;  this  will  not  affect  our  asymptotic  analysis.)  We  record 
everything  we  need  to  know  about  this  algorithm  in  the  following  matrix  A,  which  has  one  column  for  each  loop 
index  *1,  i2,  is,  one  row  for  each  array  A,  B ,  C ,  and  ones  and  zeros  to  indicate  which  arrays  have  which  loop  indices 
as  subscripts,  i.e., 

h  ii  h 

All  0  1  \ 

A  =  B  0  1  1  ; 

C  \  1  1  0  / 

for  example,  the  top-left  one  indicates  that  i\  is  a  subscript  of  A.  Suppose  we  solve  the  linear  program  to  maximize 
xi  +  x2  +  Xs  subject  to  A  •  (aq,  x2,  Xs)T  <  (1, 1, 1)T,  getting  aq  =  x2  =  x3  =  1/2.  Then  Theorem  7.1  tells  us  that 
shbl  =  X1+X2  +  X3  =  3/2,  yielding  the  well-known  communication  lower  bound  of  fl(Ar3/MSHBL_1)  =  fl(N3 /M1^2) 
for  any  sequential  implementation  with  a  fast  memory  size  of  M.  Furthermore,  a  communication-optimal  sequential 
implementation,  where  the  only  optimization  permitted  is  reordering  the  execution  of  the  inner  loop  iterations  to 
enable  reduced  communication  (e.g.,  using  Strassen’s  algorithm  instead  is  not  permitted)  is  given  by  blocking  the 
3  loops  using  blocks  of  size  MXl-by-MX2-by-MX3 ,  i.e.,  M1/2-by-M1/2-by-M1/2,  yielding  the  code 

for  ji  =  1  :  MXl  :  N,  for  j2  =  1  :  MX2  :  N,  for  j3  =  1  :  MX3  :  N, 

for  ki  =  0  :  MXl  -  1,  for  k2  =  0  :  MX2  -  1,  for  k3  =  0  :  MX3  -  1, 

=  (j\,h,js)  +  {ki,k2,k3) 

C(ii,i2)  +=  A(ii,i3)  ■  B(i3,  i2) 

yielding  the  well-known  blocked  algorithm  where  the  innermost  three  loops  multiply  M1/2-by-M1/2  blocks  of  A 
and  B  and  update  a  block  of  C.  We  note  that  to  have  all  three  blocks  fit  in  fast  memory  simultaneously,  they 
would  have  to  be  slightly  smaller  than  M1l2-by-M1l2  by  a  constant  factor.  We  will  address  this  constant  factor 
and  others,  which  are  important  in  practice,  in  Part  2  of  this  work  (see  Section  8).  A 

Example:  Complicated  Code  (Part  1/4).  The  code  is 

for  i[  =  1  :  N,  for  i2  =  1  :  N,  for  i3  =  1  :  N,  for  *4  =  1:  N,  for  *5  =  1:  N,  for  *6  =  1  :  N, 

Ai(ii,  *3,  *6)  +=  funci(A2(*i,  *2,  *4),  A3(i2,  *3,  *5),  A4(i3,  *4,  *6)) 

A5(i2,  i&)  +=  func2(A6(*i,  U,  k),  A3(i3,  *4,  *6)) 

This  is  not  meant  to  be  a  practical  algorithm  but  rather  an  example  of  the  generality  of  our  approach.  We  record 
everything  we  need  to  know  about  this  program  into  a  6-by-6  matrix  A  with  one  column  for  every  loop  index 
*  1, . . .  ,*6,  one  row  for  every  distinct  array  expression  A\, . . . ,  A§,  and  ones  and  zeros  indicating  which  loop  index 
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appears  as  a  subscript  of  which  array.  We  again  solve  the  linear  program  to  maximize  x±  +  •  •  ■  +  Xq  subject  to 
A  •  {x\, . . . ,  xq)t  <  (1, . . . ,  1)T,  getting  (x\, . . . ,  Xq)  =  (|,  |,  |).  Then  Theorem  7.1  tells  us  that  shbl  = 

X\-\ —  ■  +  Xq  =  yielding  the  communication  lower  bound  of  S2(1V6/MSHBL_1)  =  tt(N6 /M8'7)  for  any  sequential 
implementation  with  a  fast  memory  size  of  M .  Furthermore,  a  communication-optimal  sequential  implementation 
is  given  by  blocking  loop  index  ij  by  block  size  of  Mxi .  In  other  words,  the  six  innermost  loops  operate  on  a  block 
of  size  M2/7- by—  •  •  -by -M4/7.  A 

Other  examples  appearing  later  include  matrix-vector  multiplication,  tensor  contractions,  the  direct  TV-body  algo¬ 
rithm,  database  join,  and  computing  matrix  powers  Ak . 


2  Geometric  Model 

We  begin  by  reviewing  the  geometric  model  of  matrix  multiplication  introduced  in  [13] ,  describe  how  it  was  extended 
to  more  general  linear  algebra  algorithms  in  [4],  and  finally  show  how  to  generalize  it  to  the  class  of  programs 
considered  in  this  work. 

We  geometrically  model  matrix  multiplication  C  =  A  ■  B  as  sketched  in  Parts  1/5  and  2/5  of  the  matrix 
multiplication  example  (above).  However  the  ‘classical’  (3  nested  loops)  algorithm  is  organized,  we  can  represent  it 
by  a  set  of  integer  lattice  points  indexed  by  I  =  (*i,  A,  A)  £  1? .  If  A,  B ,  C  are  N-by-N ,  these  indices  occupy  an  N- 
by-N-by-N  discrete  cube  Z  =  {1, . . . ,  N}3.  Each  point  1  G  Z  represents  an  operation  C(ii,  i2)+=A(ii ,iz)-B(iz,  i2) 
in  the  inner  loop,  and  each  projection  of  I  onto  a  face  of  the  cube  represents  a  required  operand:  e.g.,  (i-\ ,  A) 
represents  C{i\,i2)  (see  Figure  1). 


Figure  1:  Modeling  3x3  Matrix  Multiplication  as  a  Lattice  (each  lattice  point  represented  by  a  cube) 

We  want  a  bound  F  >  |2£|,  where  E  C  Z  is  any  set  of  lattice  points  I  representing  operations  that  can  be 
performed  just  using  data  available  in  fast  memory  of  size  M.  Let  Ea:  Eb  and  Ec  be  projections  of  E  onto  faces 
of  the  cube  representing  all  the  required  operands  A(i\,iz),  H(A,*2)  and  C(i±,i2),  resp.,  needed  to  perform  the 
operations  represented  by  E.  A  special  case  of  [16,  Theorem  2]  gives  us  the  desired  bound  on  E: 

Theorem  2.1  (Discrete  Loomis- Whitney  Inequality,  3D  Case).  With  the  above  notation,  for  any  finite 
subset  E  of  the  3D  integer  lattice  1? ,  |2£|  <  \Ea\1^2  ■  \ Eb\ 1^2  ■  |£'c'|1^2- 

Finally,  since  the  entries  represented  by  Ea,Eb,Eq  fit  in  fast  memory  by  assumption  (i.e.,  \Ea\,  \Eb\,  \Ec\  <  M), 
this  yields  the  desired  bound  F  >  \E\: 

\E\  <  \EA\l/2  ■  \Eb\1/2  ■  \EC\1/2  <  M1/2  ■  M1/2  ■  M1/2  =  M3/2  =:  F  .  (2.1) 
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Irony  et  al.  [13]  applied  this  approach  to  obtain  a  lower  bound  for  matrix  multiplication.  Ballard  et  al.  [4] 
extended  this  approach  to  programs  of  the  form 

for  all  (*i,i2,  *3)  €  Z  C  Z3,  in  some  order, 

C{h,i2)  =  C(iltia)  +ilti2  A(ii,i3)  *il>i2)i 3  B(i3,i2) 

and  picked  the  iteration  space  Z,  arrays  A,  B,  C,  and  binary  operations  +i1:i2  and  *i1,i2,i3  to  represent  many  (dense 
or  sparse)  linear  algebra  algorithms.  To  make  this  generalization,  Ballard  et  al.  made  several  observations,  which 
also  apply  to  our  case,  below. 

First,  explicitly  nested  loops  are  not  necessary,  as  long  as  one  can  identify  each  execution  of  the  statement (s)  in 
the  inner  loop  with  a  unique  lattice  point;  for  example,  many  recursive  computations  can  also  match  this  model. 
Second,  the  upper  bound  F  when  Z  is  an  N-by-N-by-N  cube  is  valid  for  any  subset  Z  C  Z3.  Thus,  we  can  use 
this  bound  for  sparse  linear  algebra;  sparsity  may  make  the  bound  hard  or  impossible  to  attain,  but  it  is  still  valid. 
Third,  the  memory  locations  represented  by  array  expressions  like  A(ji,i3)  need  not  be  contiguous  in  memory,  as 
long  as  there  is  an  injective2  mapping  from  array  expressions  to  memory  locations.  In  other  words,  one  can  shift 
the  index  expressions  like  A(ii  +  3,  i3  +  7),  or  use  pointers,  or  more  complex  kinds  of  data  structures,  to  determine 
where  the  array  entry  actually  resides  in  memory.  Fourth,  the  arrays  A,  B  and  C  do  not  have  to  be  distinct;  in  the 
case  of  (in-place)  LU  factorization  they  are  in  fact  the  same,  since  L  and  U  overwrite  A.  Rather  than  bounding 
each  array,  say  \Ea\  <  M,  if  we  knew  that  A,  B  and  C  did  not  overlap,  we  could  tighten  the  bound  by  a  constant 
factor  by  using  \Ea\  +  \Eb\  +  \Ec\  <  M.  (For  simplicity,  we  will  generally  not  worry  about  such  constant  factors 
in  this  paper,  and  defer  their  discussion  to  Part  2  of  this  work.) 

Now  we  consider  our  general  case,  which  we  will  refer  to  as  the  geometric  model.  This  model  generalizes  the 
preceding  one  by  allowing  an  arbitrary  number  of  loops  defining  the  index  space  and  an  arbitrary  number  of  arrays 
each  with  arbitrary  dimension.  The  array  subscripts  themselves  can  be  arbitrary  affine  (integer)  combinations  of 
the  loop  indices,  e.g.,  A(i2,  i3  —  2*4,  i\  +  i2  +  3). 

Definition  2.2.  An  instance  of  the  geometric  model  is  an  abstract  representation  of  an  algorithm,  taking  the  form 

for  all  leZC  Zd,  in  some  order, 

(2.2) 

inner  Joop(I,  {A1} ...,  Am),  (</>i, . . . ,  4>m)) 

For  j  £  {1, . . . ,  m},  the  functions  <j>j  :  7Ld  — »•  Z di  are  Z-affine  maps,  and  the  functions  Aj  :  4>j(Zd)  — >  {variables}  are 
injections  into  some  set  of  variables.  The  subroutine  inner _loop  constitutes  a  fairly  arbitrary  section  of  code,  with 
the  constraint  that  it  accesses  each  of  the  variables  Ai((f>i(l)), . . . ,  Am(<pm(I)) ,  either  as  an  input,  output,  or  both, 
in  every  iteration  I  £  Z . 

(We  will  be  more  concrete  about  what  it  means  to  ‘access  a  variable’  when  we  introduce  our  execution  model  in 
Section  4.) 

Assuming  that  all  variables  are  accessed  in  each  iteration  rules  out  certain  optimizations,  or  requires  us  not 
to  count  certain  inner  loop  iterations  in  our  lower  bound.  For  example,  consider  Boolean  matrix  multiplication, 
where  the  inner  loop  body  is  C{i\,i2)  =  C(ii,i2 )  V  (A(ii,i3)  A  B{i3,i2)).  As  long  as  A(ii,i3)  A  B{i3,i2)  is  false, 
C{i\,i2)  does  not  need  to  be  accessed.  So  we  need  either  to  exclude  such  optimizations,  or  not  count  such  loop 
iterations.  And  once  C(i\,i2)  is  true,  its  value  cannot  change,  and  no  more  values  of  A(ii,i3)  and  B(i3,i2)  need 
to  be  accessed;  if  this  optimization  is  performed,  then  these  loop  iterations  would  not  occur  and  not  be  counted  in 
the  lower  bound.  In  general,  if  there  are  data-dependent  branches  in  the  inner  loop  (e.g.,  ‘flip  a  coin. . .  ’),  we  may 
not  know  to  which  subset  Z  of  all  the  loop  iterations  our  model  applies  until  the  code  has  executed.  We  refer  to  a 
later  example  (database  join)  and  [4,  Section  3.2.1]  for  more  examples  and  discussion  of  this  assumption. 

Following  the  approach  for  matrix  multiplication,  we  need  to  bound  \E\  where  FI  is  a  nonempty  finite  subset  of  Z. 
To  execute  E,  we  need  array  entries  Ai((f>i(E)), . . . ,  Arn(<prn(E)) .  Analogous  to  the  analysis  of  matrix  multiplication, 
we  need  to  bound  \E\  in  terms  of  the  number  of  array  entries  needed:  \<f>j(E)\  entries  of  Aj,  for  j  =  {1, . . . ,  m}. 

Suppose  there  were  nonnegative  constants  s3, . . .  ,sm  such  that  for  any  such  E, 

m 

\E\<U\Mm\«  (2-3) 

3= 1 

2 We  assume  injectivity  since  otherwise  an  algorithm  could  potentially  access  only  the  same  few  memory  locations  over  and  over 
again.  Given  our  asymptotic  analysis,  however,  we  could  weaken  this  assumption,  allowing  a  constant  number  of  array  entries  to  occupy 
the  same  memory  address. 
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i.e. ,  a  generalization  of  Theorem  2.1.  Then  just  as  with  matrix  multiplication,  if  the  number  of  entries  of  each  array 
were  bounded  by  \<f>j(E)\  <  M,  we  could  bound 

77i  m 

\E\  <  1 4>j{E)\a=  <  MSj  =  M^r=isi  =:  Ms HBL  =:  p 

3=1  3  =  1 

where  sHbl  :=  YJjLt  sj- 

The  next  section  shows  how  to  determine  the  constants  si,...,sm  such  that  inequality  (2.3)  holds.  To  give 
some  rough  intuition  for  the  result,  consider  again  the  special  case  of  Theorem  2.1,  where  \E\  can  be  interpreted 
as  a  volume,  and  each  \cf>j(E)\  as  an  area.  Thus  one  can  think  of  \E\  as  having  units  of,  say,  meters3,  and  \<fj(E)\ 
as  having  units  of  meters2.  Thus,  for  Jlj=i  \(l)j(E)\Sj  to  have  units  of  meters3,  we  would  expect  the  Sj  to  satisfy 
3  =  2si  +  2s2  +  2s3.  But  if  E  were  a  lower  dimensional  subset,  lying  say  in  a  plane,  then  \E\  would  have  units 
meters2  and  we  would  get  a  different  constraint  on  the  Sj.  This  intuition  is  formalized  in  Theorem  3.2  below. 


3  Upper  bounds  from  HBL  theory 

As  discussed  above,  we  aim  to  establish  an  upper  bound  of  the  form  (2.3)  for  \E\,  for  any  finite  set  E  of  loop 
iterations.  In  this  section  we  formulate  and  prove  Theorem  3.2,  which  states  that  such  a  bound  is  valid  if  and  only 
if  the  exponents  si, . . . ,  sm  satisfy  a  certain  finite  set  of  linear  constraints,  specified  in  (3.1)  below. 


3.1  Statement  and  Preliminaries 


As  in  the  previous  section,  I  =  (*i, . . . ,  id)  denotes  a  point  in  Zd.  For  each  j  £  {1,2,...,  m}  we  are  given  a  Z-affine 
map  (f>j :  Zd  — >  Zdj ,  i.e.,  a  function  that  returns  an  affine  integer  combination  of  the  coordinates  as  a 

dj- tuple,  where  dj  is  the  dimension  (number  of  arguments  in  the  subscript)  of  the  array  Aj.  So,  the  action  of  fj 
can  be  represented  as  multiplication  by  a  dj-  by— d  integer  matrix,  followed  by  a  translation  by  an  integer  vector 
of  length  dj.  As  mentioned  in  Section  2,  there  is  no  loss  of  generality  in  assuming  that  (j)j  is  linear,  rather  than 
affine.  This  is  because  Aj  can  be  any  injection  (into  memory  addresses),  so  we  can  always  compose  the  translation 
(a  bijection)  with  Aj.  In  this  section  we  assume  that  this  has  been  done  already. 

Notation  3.1.  We  establish  some  group  theoretic  notation;  for  an  algebra  reference  see,  e.g.,  [15].  We  regard  the 
set  hr  (for  any  natural  number  r)  as  an  additive  Abelian  group,  and  any  Z-linear  map  <j>:  Z'r  — >  Zr  as  a  group 
homomorphism.  An  Abelian  group  G  has  a  property  called  its  rank,  related  to  the  dimension  of  a  vector  space:  the 
rank  of  G  is  the  cardinality  of  any  maximal  subset  of  G  that  is  linearly  independent.  All  groups  discussed  in  this 
paper  are  Abelian  and  finitely  generated,  i.e.,  generated  overZ  by  finite  subsets.  Such  groups  have  finite  ranks.  For 
example,  the  rank  of  V  is  r.  The  notation  H  <  G  indicates  that  H  is  a  subgroup  of  G,  and  the  notation  H  <  G 
indicates  that  H  is  a  proper  subgroup. 

The  following  theorem  provides  bounds  which  are  fundamental  to  our  application. 

Theorem  3.2.  Let  d  and  dj  be  nonnegative  integers  and  <f>j :  Zd  — >  Zdj  be  group  homomorphisms  for  j  £ 
{1,2,...,  m),  and  consider  some  s  £  [0,  \]m .  Suppose  that 
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rank(ld)  <  Sj  rank {(f>j(H))  for  all  subgroups  H  <  Zd . 

3= 1 


Then 
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iu<n  \(fj(E)\Si  for  all  nonempty  finite  sets  E  C  Zd. 

3  =  1 


Conversely,  if  (3.2)  holds  for  s  £  [0,  l]m,  then  s  satisfies  (3.1). 


(3.1) 


(3.2) 


While  (3.1)  may  suggest  that  there  are  an  infinite  number  of  inequalities  constraining  s  (one  for  each  subgroup 
H  <  Zd),  in  fact  each  rank  is  an  integer  in  the  range  [0,  d],  and  so  there  are  at  most  (d+  l)m+1  different  inequalities. 

Note  that  if  s  £  [0,oo)m  satisfies  (3.1)  or  (3.2),  then  any  t  £  [0,oo)m  where  each  tj  >  Sj  does  too.  This 
is  obvious  for  (3.1);  in  the  case  of  (3.2),  since  E  is  nonempty  and  finite,  then  for  each  j,  1  <  \cj)j(E)\  <  oo,  so 

1 01 W  <  \ME)\^- 

The  following  result  demonstrates  there  is  no  loss  of  generality  restricting  s  £  [0,  l]m,  rather  than  in  [0,oo)m,  in 
Theorem  3.2. 
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Proposition  3.3.  Whenever  (3.1)  or  (3.2)  holds  for  some  s  G  [0,  oo)m,  it  holds  fort  G  [0,  l]m  where  tj  =  min(sj,  1) 
for  j  G  { 1 - •»>}■ 


3.2  Generalizations 


We  formulate  here  two  extensions  of  Theorem  3.2.  Theorem  3.6  extends  Theorem  3.2  both  by  replacing  the  sets 
E  by  functions  and  by  replacing  the  codomains  hdj  by  finitely  generated  Abelian  groups  with  torsion;  this  is  a 
digression  which  is  not  needed  for  our  other  results.  Theorem  3.10  instead  replaces  torsion- free  finitely  generated 
Abelian  groups  by  finite-dimensional  vector  spaces  over  an  arbitrary  field.  In  the  case  this  field  is  the  set  of  rational 
numbers  Q,  it  represents  only  a  modest  extension  of  Theorem  3.2,  but  the  vector  space  framework  is  more  convenient 
for  our  method  of  proof.  We  will  show  in  Section  3.2  how  Theorem  3.10  implies  Theorem  3.2  and  Theorem  3.6. 
The  main  part  of  Section  3  will  be  devoted  to  the  proof  of  Theorem  3.10. 


Notation  3.4.  Let  X  he  any  set.  In  the  discussion  below,  X  will  always  be  either  a  finitely  generated  Abelian 
group,  or  a  finite- dimensional  vector  space  over  a  field  F.  For  p  G  [l,oo),  £P(X)  denotes  the  space  of  all  p-power 
summable  functions  f :  X  — >  C,  equipped  with  the  norm 


ll/llP  = 


£°°(X)  is  the  space  of  all  bounded  functions  f:  X  — »  C,  equipped  with  the  norm  ||/||oo  =  suPa;eJf  \.f(x)\-  These 
definitions  apply  even  when  X  is  uncountable:  if  p  G  [l,oo),  any  function  in  £P(X)  vanishes  at  all  but  countably 
many  points  in  X,  and  the  £°°-norm  is  still  the  supremum  of  \f\,  not  for  instance  an  essential  supremum  defined 
using  Lebesgue  measure. 

We  will  use  the  terminology  HBL  datum  to  refer  to  either  of  two  types  of  structures;  the  first  is  defined  as 
follows: 


Definition  3.5.  An  Abelian  group  HBL  datum  is  a  3-tuple 


(G,  (Gj),  (fa)) 


(G,  (G\,  ...,  Gm),  ((/>!,  ...,  4>m)) 


where  G  and  each  Gj  are  finitely  generated  Abelian  groups,  G  is  torsion- free,  each  <j>j:  G  — ►  Gj  is  a  group  homo¬ 
morphism,  and  the  notation  on  the  left-hand  side  always  implies  that  j  G  {1,  2, . . .  ,m}. 

Theorem  3.6.  Consider  an  Abelian  group  HBL  datum  ( G ,  (Gj),  (<f>j))  and  s  G  [0,  l]m.  Suppose  that 

m 

rank(IL)  <  Sj  rank (<f>j(H))  for  all  subgroups  H  <  G.  (3-3) 

i=i 


Then 

m  m 

Ell/i^iW)  ^  n  ||/,  ||i/s,  for  all  functions  0  <  fj  G  £1/Sj(Gj).  (3.4) 

xGGj-1  j= 1 


In  particular, 

m 

i£i<n  \4>j(E)\1^Si  for  all  nonempty  finite  sets  EGG. 
1=1 


Conversely,  if  (3.5)  holds  for  s  G  [0,  l]m,  then  s  satisfies  (3.8). 


(3.5) 


Remark  3.7.  [6,  Theorem  2-4]  treats  the  more  general  situation  in  which  G  is  not  required  to  be  torsion-free,  and 
establishes  the  conclusion  (3.4)  in  the  weaker  form 


m  m 

^  c'ii  Wfjh/Sj  for  all  functions  0  <  fj  G  £1/sj  (Gj).  (3.6) 

x£Gj= 1  3-1 


where  the  constant  C  <  oo  depends  only  on  G  and  ... ,  </>m}.  The  constant  C  =  1  established  here  is  optimal. 
Indeed,  consider  any  single  x  G  G,  and  for  each  j  define  fj  to  be  1^ .  ,  the  indicator  function  for  <f>j  (x) .  Then  both 

sides  of  the  inequalities  in  (3.4)  are  equal  to  1.  In  Remark  3.11,  we  will  show  that  the  constant  C  is  bounded  above 
by  the  number  of  torsion  elements  of  G. 
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Proof  of  Theorem  3.2.  Theorem  3.2  follows  from  Theorem  3.6  by  setting  G  =  7Ld  and  Gj  =  Zdj  for  each  j.  □ 

Proof  of  Theorem  3.6  (necessity) .  Necessity  of  (3.3)  follows  from  an  argument  given  in  [6,  Theorem  2.4]. 

Consider  any  subgroup  H  <  G.  Let  r  =  rank(ff).  By  definition  of  rank,  there  exists  a  set  {ej}J_1  of  elements  of 
H  such  that  for  any  coefficients  to*,  rij  €  Z,  121=1  miei  —  121=1  n*e*  only  if  m.;  =  n*  for  all  i.  For  any  positive  integer 
N  define  Em  to  be  the  set  of  all  elements  of  the  form  Y2i=i  nieii  where  each  n,  ranges  freely  over  {1,2, . . .  ,N}. 
Then  \EN\  =  Nr. 

On  the  other  hand,  for  j  £  {1, . . . ,  to}, 

\<pj(EN)\  <  AjNTaD (3.7) 

where  Aj  is  a  finite  constant  which  depends  on  (f>j ,  on  the  structure  of  Hj1  and  on  the  choice  of  {e,},  but  not  on 
N.  Indeed,  it  follows  from  the  definition  of  rank  that  for  each  j  it  is  possible  to  permute  the  indices  i  so  that  for 
each  i  >  rank (<f>j(H))  there  exist  integers  ki  and  Kij  such  that 

rank(0j  (i/)) 

kifj  {C-i)  —  1  '  Ki,l<t>j(ei)- 

1= 1 


The  upper  bound  (3.7)  follows  from  these  relations. 
Inequality  (3.2)  therefore  asserts  that 


^yrank(ff)  <  ^.s.  rank (<pj(H))  _  rank 

3=1 

where  A  <  oo  is  independent  of  N.  By  letting  N  tend  to  infinity,  we  conclude  that  rank (H)  <  12JL i  si  ia,nk(<j>j  (H)) , 
as  was  to  be  shown.  □ 

We  show  sufficiency  in  Theorem  3.6  in  its  full  generality  is  a  consequence  of  the  special  case  in  which  all  of  the 
groups  Gj  are  torsion-free. 

Reduction  of  Theorem  3.6  (sufficiency)  to  the  torsion-free  case.  Let  us  suppose  that  the  Abelian  group  HBL  datum 
(G,  (Gj),  ((j>j))  and  s  satisfy  (3.3).  According  to  the  structure  theorem  of  finitely  generated  Abelian  groups,  each  Gj 
is  isomorphic  to  Gj  ®  Tj  where  Tj  is  a  finite  group  and  Gj  is  torsion-free.  Here  ®  denotes  the  direct  sum  of  Abelian 
groups;  G’  ©  G"  is  the  Cartesian  product  G'  x  G"  equipped  with  the  natural  componentwise  group  law.  Define 
7 Tj  :  Gj  ©  Tj  — >  Gj  to  be  the  natural  projection;  thus  TTj  (x,t)  =  x  for  (x,t)  £  Gj  x  Tj .  Define  <pj  =  7 Tj  o<f>jm.  G  — >  Gj . 

If  K  is  a  subgroup  of  Gj,  then  rank(7r,  (A"))  =  rank(A")  since  the  kernel  Tj  of  i Tj  is  a  finite  group.  Therefore  for 
any  subgroup  H  <  G,  rank (cj>j(H))  =  rank (cj>j(H)).  Therefore  whenever  (G,  (Gj),  ((j>j))  and  s  satisfy  the  hypotheses 
of  Theorem  3.6,  (G,  (Gj),  (4>j))  and  s  also  satisfy  those  same  hypotheses. 

Under  these  hypotheses,  consider  any  ?n-tuple  /  =  (fj)  from  (3.4),  and  for  each  j,  define  fj  by 

fj(y)  =  ma xfj(y,t),  for  y  €  Gj. 


For  any  x  G  G,  f3((fj(x))  <  fj(f>(x)).  Consequently  fljli  fMoix))  <  TIjli  fj(4>j(x))- 

We  are  assuming  validity  of  Theorem  3.6  in  the  torsion-free  case.  Its  conclusion  asserts  that 


m  m 

xeGj—l  j- 1 


For  each  j,  ||/jj|i/s  <  ||/j||i/s..  This  is  obvious  when  Sj  =  0.  If  Sj  0  then 


ll/ill 


l/Sj 

1/Sj 


E  hiy)1/Si  -  E  ma xfjfat)1''*  <  E  E 

y£Gj  y^Gj  y^Gj  t^zTj 


II  fi 


1/Sj 
1/Sj  ■ 


Combining  these  inequalities  gives  the  conclusion  of  Theorem  3.6. 
Our  second  generalization  of  Theorem  3.2  is  as  follows. 


□ 


Notation  3.8.  dim(H)  will  denote  the  dimension  of  a  vector  space  V  over  a  field  F;  all  our  vector  spaces  are 
finite- dimensional.  Similar  to  our  notation  for  subgroups,  W  <  V  indicates  that  W  is  a  subspace  ofV,  and  W  <  V 
indicates  that  W  is  a  proper  subspace. 

Definition  3.9.  A  vector  space  HBL  datum  is  a  3-tuple 


(V,  ( V )),  (&)) 


{V,{V1,...,Vm),{cf1,...,cfm)) 


where  V  and  Vj  are  finite- dimensional  vector  spaces  over  a  field  F,  <f>j:  V  — »•  Vj  is  an  F -linear  map,  and  the  notation 
on  the  left-hand  side  always  implies  that  j  €  {1,2,...,  m}. 


Theorem  3.10.  Consider  a  vector  space  HBL  datum  (V,  (Vj),  {(fj))  and  s  €  [0,  l]m.  Suppose  that 


m 

dim(W)  <  E  Sj  dim(</>j(VF))  for  all  subspaces  W  <  V. 
j= i 


Then 

m  m 

e  n  MM*))  ^  n  WfjWi/sj  for  oil  functions  0  <  fj  G  I1/sj  (Vj). 

xev  j—i  j= l 


In  particular, 

m 

\e\  <  n  \me)\i/s> 

3= 1 

Conversely,  if  (3.10)  holds  for  s  e  [0,  l]m,  and  if¥ 


for  all  nonempty  finite  sets  EC  V. 

=  Q  or  if  F  is  finite,  then  s  satisfies  (3.8). 


(3.8) 


(3.9) 


(3.10) 


The  conclusions  (3.4)  and  (3.9)  remain  valid  for  functions  fj  without  the  requirement  that  the  Ix!si  norms  are 
finite,  under  the  convention  that  the  product  0  •  oo  is  interpreted  as  zero  whenever  it  arises.  For  then  if  any  fj  has 
infinite  norm,  then  either  the  right-hand  side  is  oo  while  the  left-hand  side  is  finite,  or  both  sides  are  zero;  in  either 
case,  the  inequality  holds. 


Proof  of  Theorem  3.6  (sufficiency)  in  the  torsion-free  case.  Let  us  suppose  that  the  Abelian  group  HBL  datum 
(G,  (Gj),  (cfj))  and  s  satisfy  the  hypothesis  (3.3)  of  Theorem  3.6;  furthermore  suppose  that  each  group  Gj  is 
torsion-free.  Gj  is  isomorphic  to  Zdj  where  dj  =  rank  (Gj).  Likewise  G  is  isomorphic  to  Zd  where  d  =  rank(G). 
Thus  we  may  identify  (G,  (Gj),  ((fj))  with  (Zd,  (Zdj),  ((fj)),  where  each  homomorphism  (fj  :  lA  — >  Zdj  is  obtained 
from  (fj  by  composition  with  these  isomorphisms.  Defining  scalar  multiplication  in  the  natural  way  (i.e. ,  treating 
Zd  and  and  Zdj  as  Z-modules) ,  we  represent  each  Z-linear  map  (fj  by  a  matrix  with  integer  entries. 

Let  F  =  Q.  Regard  Zd  and  Zdj  as  subsets  of  Qd  and  of  Qdj ,  respectively.  Consider  the  vector  space  HBL  datum 
(V,  (V}),  (ifj))  defined  as  follows.  Let  V  =  Qd  and  Vj  =  Qdj ,  and  let  ifj  :  Qd  — >  Qdj  be  a  Q-linear  map  represented 
by  the  same  integer  matrix  as  (fj. 

Observe  that  (V,  (Vj),  (ifj))  and  s  satisfy  the  hypothesis  (3.8)  of  Theorem  3.10.  Given  any  subspace  W  < 
V,  there  exists  a  basis  S  for  W  over  Q  which  consists  of  elements  of  Zd.  Define  H  <  Zd  to  be  the  subgroup 
generated  by  S  (over  Z).  Then  rank(H)  =  dim(VF).  Moreover,  tfj(W)  equals  the  span  over  Q  of  (fj(H ),  and 
dim(^>j(VF))  =  rank (cfj(H)).  The  hypothesis  rank(iJ)  <  YfijL i  sj  rank(<fj(H))  is  therefore  equivalently  written  as 
dim(VF)  <  YfJj=\  sj  dim(^j(VF)),  which  is  (3.8)  for  W. 

Consider  the  m-tuple  /  =  (fj)  corresponding  to  any  inequality  in  (3.4)  for  (Zd,  (Zdj),  ((fj))  and  s,  and  for  each 
j  define  Fj  by  Fj(y)  =  fj(y)  if  y  €  Zdi  C  QdJ,  and  Fj(y)  =  0  otherwise.  By  Theorem  3.10, 


m  m  mm 

e  iu-to)  =  e  <  n  = n  wfih/sr 

xezd  j— i  j=i  j= i  j— 1 


Thus  the  conclusion  (3.4)  of  Theorem  3.6  is  satisfied.  □ 

Conversely,  it  is  possible  to  derive  Theorem  3.10  for  the  case  F  =  Q  from  the  special  case  of  Theorem  3.6  in 
which  G  =  Zd  and  Gj  =  Zdj  by  similar  reasoning  involving  multiplication  by  large  integers  to  clear  denominators. 
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Remark  3.11.  Suppose  we  relax  our  requirement  of  an  Abelian  group  HBL  datum  (Definition  3.5)  that  the  finitely 
generated  Abelian  group  G  is  torsion-free;  let  us  call  this  an  HBL  datum  with  torsion.  The  torsion  subgroup  T(G) 
of  G  is  the  (finite)  set  of  all  elements  x  £  G  for  which  there  exists  0  /  n  £  Z  such  that  nx  =  0.  As  mentioned 
in  Remark  3.7,  it  was  shown  in  [6,  Theorem  2-4]  that  for  an  HBL  datum  with  torsion,  the  rank  conditions  (3.3) 
are  necessary  and  sufficient  for  the  existence  of  some  finite  constant  C  such  that  (3.6)  holds.  A  consequence  of 
Theorem  3.6  is  a  concrete  upper  bound  for  the  constant  C  in  these  inequalities. 

Theorem  3.12.  Consider  {G,{Gj),{(j>j)),  an  HBL  datum  with  torsion,  and  s  £  [0,  l]m.  If  (3.3)  holds,  then  (3.6) 
holds  with  C  =  |T(G)|.  In  particular, 


|2?|  <  |T(G)|  •  \<pj(E)\Sj  for  all  nonempty  finite  sets  E  C  G.  (3-11) 

j=i 

Conversely,  if  (3.11)  holds  for  s  €  [0,  l]m,  then  s  satisfies  (3.3). 

Proof.  To  prove  (3.6),  express  G  as  G  ®  T(G)  where  G  <  G  is  torsion-free.  Thus  arbitrary  elements  x  €  G 
are  expressed  as  x  =  {x,t)  with  x  €  G  and  t  £  T(G).  Then  (G,  (Gj),  (<j)j  |^))  is  an  Abelian  group  HBL  datum 
(with  G  torsion-free)  to  which  Theorem  3.6  can  be  applied.  Consider  any  t  £  T(G).  Define  gj :  Gj  — >  [0,  oo)  by 
9 j  (xj )  =  fj(xj  +<pj(0,t)).  Then  fj(<Pj(y,t))  =  fj{<j>j{y,  0)  +  4>j(0,  t))  =  9j(<t>j(y,  0)),  so 

m  mm  m 

n  n  fjiMv’t)) = Yi  n  gj{<j>j(y,0))  ~  n  ^(<5)  ii  i/sj  -  n  \\fj\\i/<>r 

yeGi=1  yeG0=l  i=1  i=1 

The  first  inequality  is  an  application  of  Theorem  3.6.  Summation  with  respect  to  f  £  T(G)  gives  the  required 
bound. 

To  show  necessity,  we  consider  just  the  inequalities  (3.11)  corresponding  to  the  subsets  E  C  G  and  follow  the 
proof  of  the  converse  of  Theorem  3.6,  except  substituting  A|T(G)|  for  A.  □ 

The  factor  |T(G)|  cannot  be  improved  if  the  groups  Gj  are  torsion  free,  or  more  generally  ifT(G)  is  contained 
in  the  intersection  of  the  kernels  of  all  the  homomorphisms  (j>j;  this  is  seen  by  considering  E  =  T(G). 

3.3  The  polytope  V 

The  constraints  (3.3)  and  (3.8)  are  encoded  by  convex  polytopes  in  [0,  l]m. 

Definition  3.13.  For  any  Abelian  group  HBL  datum  (G,  {Gj),  (<t>j)),  we  denote  the  set  of  all  s  £  [0,  l]m  which  satisfy 
(3.3)  by  V{G,  {Gj),  (<f>j))-  For  any  vector  space  HBL  datum  {V,  (Vj),  (4>j)),  we  denote  the  set  of  all  s  £  [0,  l]m  which 
satisfy  (3.8)  by  V (V,  (Vj),  (fa)) . 

V  =  V{V,  {Vj),  (<pj))  is  the  subset  of  Rm  defined  by  the  inequalities  0  <  Sj  <  1  for  all  j,  and  r  <  Li  sjrj 
where  (r,  rq, . . . ,  rm)  is  any  element  of  {1,  2, . . . ,  dim(V)}  x  h  •  •  •  ,dim(0j(V))}  for  which  there  exists  a 

subspace  W  <  V  which  satisfies  dim(H/)  =  r  and  dim^^tT))  =  fj  for  all  j.  Although  infinitely  many  candidate 
subspaces  W  must  potentially  be  considered  in  any  calculation  of  V,  there  are  fewer  than  (dim(H)  +  l)m+1  tuples 
(r,  r i, . .  • ,  rm)  which  generate  the  inequalities  defining  V .  Thus  V  is  a  convex  polytope  with  finitely  many  extreme 
points,  and  is  equal  to  the  convex  hull  of  the  set  of  all  of  these  extreme  points.  This  discussion  applies  equally  to 

V{G.{Gj),{0j)). 

In  the  course  of  proving  Theorem  3.6  above,  we  established  the  following  result. 

Lemma  3.14.  Let  {G,  {Gj),  (<pj))  be  Abelian  group  HBL  datum,  let  {Zd,{Z dj),{4>j))  be  the  associated  datum  with 
torsion-free  codomains,  and  let  {Qd,  {Qdj),  {ipj))  be  the  associated  vector  space  HBL  datum.  Then 

V{G,  {Gj),  (efj))  =  V{Zd,  {Zd>),  (4>j))  =  V{Qd,  {®d>),  tyj)). 

Now  we  prove  Proposition  3.3,  i.e.,  that  there  was  no  loss  of  generality  in  assuming  each  Sj  <  1. 

Proof  of  Proposition  3.3.  Proposition  3.3  concerns  the  case  of  Theorem  3.2,  but  in  the  course  of  its  proof  we  will 
show  that  this  result  also  applies  to  Theorems  3.6  and  3.10. 

We  first  show  the  result  concerning  (3.1),  by  considering  instead  the  set  of  inequalities  (3.8).  Suppose  that  a 
vector  space  HBL  datum  {V,  (Vj),  {(f>j))  and  s  £  [0,oo)m  satisfy  (3.8),  and  suppose  that  Sfc  >  1  for  some  k.  Define 


10 


t  G  [0,  oo)m  by  tj  =  Sj  for  j  ^  k ,  and  tk  =  1.  Pick  any  subspace  W  <  V;  let  W'  W  fl  ker(^)  and  let  U  be  a 
supplement  of  W'  in  W ,  i.e.,  W  =  U  +  W'  and  dim(W)  =  dim(C/)  +  dim(W//).  Since  dim ((j>k(U))  =  dim(17), 

m 

0  <  ^  tj  dim((j)j(U))  =  (tk  —  1)  dim([7)  +  'Y  tj  dim((j>j(U))  =  —  dim(?7)  +  'Y  tj  clim(^ ([/)); 

jjtk  j^k  j— 1 

since  s  satisfies  (3.8)  and  dim (<j>k(W'))  =  0,  by  (3.8)  applied  to  W' , 

m  m 

dim(iy')  <  Y  sj  dim (<f>j(W’))  =  Y  li  dim^LF')). 
i=i 

Combining  these  inequalities  and  noting  that  4>j(U)  is  a  supplement  of  <t>j(W')  in  <f>j(W)  for  each  j ,  we  conclude 
that  t  also  satisfies  (3.8).  Given  an  m-tuple  s  with  multiple  components  >  1,  we  can  consider  each  separately 
and  apply  the  same  reasoning  to  obtain  an  ?n-tuple  t  G  [0,  l]m  with  tj  =  min(.s?- ,  1)  for  all  j.  Our  desired  conclusion 
concerning  (3.1)  follows  from  Lemma  3.14,  by  considering  the  associated  vector  space  HBL  datum  (Qd,  (Qdj),  (ipj)) 
and  noting  that  the  lemma  was  established  without  assuming  Sj  <  1.  (A  similar  conclusion  can  be  obtained  for 
(3.5).) 

Next,  we  show  the  result  concerning  (3.2).  Consider  the  following  more  general  situation.  Let  X,  Xil . . . ,  Xrn 
be  sets  and  <pj:  X  — >  Xj  be  functions  for  j  €  {l,...,m}.  Let  s  G  [0,oo)m  with  some  Sk  >  1,  and  suppose  that 
\E\  <  n;=i  \4,j(^)\Sj  f°r  any  finite  nonempty  subset  E  C  X.  Fix  one  such  set  E.  For  each  y  G  (j>k(E),  let 
Ey  =  (j)^  1(y)  n  E,  the  preimage  of  y  under  4>k\E]  thus  \4>k(Ey)\  =  1.  By  assumption,  l^l  <  njli  \4>j(Ey)\Sj ,  so  it 
follows  that 

i^i  <  n  \uEv)\8i  <  n  \ME)\sk = n  i^(^)itfc- 

j^k  j^k  jjtk 

Since  E  can  be  written  as  the  union  of  disjoint  sets  U y^j,k(E)  we  obtain 

m 

\e\=  y  \ev\<  e  ni^^)i"=i^)i-ni^^)itfe=ni^(^)i" 

ye<l>k{E)  ye<t>k(E)  j^k  j^k  j  —  1 

Given  an  m-tuple  s  satisfying  with  multiple  components  Sk  >  1,  we  can  consider  each  separately  and  apply  the 
same  reasoning  to  obtain  an  m-tuple  t  G  [0,  l]m  with  tj  =  min(sj,  1)  for  all  j.  Our  desired  conclusion  concerning 
(3.2)  follows  by  picking  (A,  (Xj),  (<j)j))  :=  (Zd,  CZdj),  ( <t>j )),  and  a  similar  conclusion  can  be  obtained  for  (3.5)  and 
(3.10). 

We  claim  that  this  result  can  be  generalized  to  (3.4)  and  (3.9),  based  on  a  comment  in  [6,  Section  8]  that  it 
generalizes  in  the  weaker  case  of  (3.6).  □ 

To  prove  Theorem  3.10,  we  show  in  Section  3.4  that  if  (3.9)  holds  at  each  extreme  point  s  of  V ,  then  it  holds  for 
all  s  G  V .  Then  in  Section  3.5,  we  show  that  when  s  is  an  extreme  point  of  V ,  the  hypothesis  (3.8)  can  be  restated 
in  a  special  form.  Finally  in  Section  3.7  we  prove  Theorem  3.10  (with  restated  hypothesis)  when  s  is  any  extreme 
point  of  V,  thus  proving  the  theorem  for  all  s  G  V . 

3.4  Interpolation  between  extreme  points  of  V 

A  reference  for  measure  theory  is  [11].  Let  (Xj.  Aj.  jij)  be  measure  spaces  for  j  G  {1,2,...,  to},  where  each  fij  is  a 
nonnegative  measure  on  the  a  algebra  Aj .  Let  Sj  be  the  set  of  all  simple  functions  fj :  Xj  -*  C.  Thus  Sj  is  the 
set  of  all  fj :  Xj  C  which  can  be  expressed  in  the  form  cAe,  where  Cj  G  C,  Ei  G  Aj,  tlj(Ei)  <  oo,  and  the 
sum  extends  over  finitely  many  indices  i. 

Let  /  i  }  1  j |  G — t  G  be  a  multilinear  map!  i.e.,  for  any  to  tuple  f  G  n?=iCXi  where  fk  =  c0fk,o  +  Cifk,i  for 

c0,  Ci  G  C  and  /fc)0,  fk,i  G  CA'fc, 

T(f)  =  C0T(fi,  .  .  .  ,  fk-l,fk,0,  fk+l,  ■  ■  ■  ,  fm )  +  C\T(fi,  .  .  .  ,  fk-l,  fk,l,  fk+l,  ■  ■  ■  ,  fm )• 

One  multilinear  extension  of  the  Riesz-Thorin  theorem  states  the  following  (see,  e.g.,  [5]). 
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Proposition  3.15  (Multilinear  Riesz— Thorin  theorem).  Suppose  that  po  =  ( Pj,o),Pi  =  (Pj,i)  G  [l,oo]m. 
Suppose  that  there  exist  Aq,Ai  £  [0,  oo)  such  that 

m  mm 

\T{f)\  <  Aq  n  ll/jllpj,0  and  TOI^inil/.lk,  .brail  fC  J  J  S , 

j=i  j— i  j=i 

-For  eac/i  0  €  (0, 1)  define  exponents  pj  g  by 

1  _  9  1—9 

Pi, 6  Pj,  o  fb,i 

T/ien  /or  each  9  £  (0, 1), 

m  m 

\T(f)\  <  Ae0A\~e  []  H/,11^  for  all  f£  []  Sj. 

i= i  i= i 

^ere  ||/j||p  =  ||/,-||lp(.y3j^.Mj)- 

In  the  context  of  Theorem  3.10  with  vector  space  HBL  datum  (V,  (' Vj ),  (4>j)),  we  consider  the  multilinear  map 

m 

nn  -zn/^,(*)) 

J6V  j=1 


representing  the  left-hand  side  in  (3.9). 

Lemma  3.16.  If  (3.9)  holds  for  every  extreme  point  ofV,  then  it  holds  for  every  s  £  V. 

Proof.  For  any  /  £  njli'Sji  we  define  another  m-tuple  /  where  for  each  j,  fj  =  \fj\  is  a  nonnegative  simple 
function.  By  hypothesis,  the  inequality  in  (3.9)  corresponding  to  /  holds  at  every  extreme  point  s  of  V,  giving 


nf)  <  e  n  /*(&(*))  =  e  n mm*))  <  n  = n  n^iiv* 


x&v  i=i 


xev  1=1 


l=i 


l=i 


As  a  consequence  of  Proposition  3.15  (with  constants  A,:  =  1),  and  the  fact  that  any  s  £  V  is  a  finite  convex 
combination  of  the  extreme  points,  this  expression  holds  for  any  s  £  V.  For  any  nonnegative  function  Fj  (e.g., 
in  Sfibi  (Vj)),  there  is  an  increasing  sequence  of  nonnegative  simple  functions  fj  whose  (pointwise)  limit  is  Fj. 
Consider  the  ?n-tuple  F  =  (Fj)  corresponding  to  any  inequality  in  (3.9),  and  consider  a  sequence  of  ?n-tuples 
f  which  converge  to  F:  then  JlJLi  fi  a^so  converges  to  fljli  Fj.  So  by  the  monotone  convergence  theorem,  the 
summations  on  both  sides  of  the  inequality  converge  as  well.  □ 


3.5  Critical  subspaces  and  extreme  points 

Assume  a  fixed  vector  space  HBL  datum  (V,  (Vj),  ((fj)),  and  let  V  continue  to  denote  the  set  of  all  s  £  [0,  l]m  which 
satisfy  (3.8). 

Definition  3.17.  Consider  any  s  £  [0,  l]m.  A  subspace  W  <  V  satisfying  dim(lF)  =  J2jLi  si  dim(0j(W))  is  said 
to  be  a  critical  subspace;  one  satisfying  dim(lF)  <  si  d\m(<j>j(W))  is  said  to  be  subcritical;  and  a  subspace 

satisfying  dim(lF)  >  si  dim(^(W))  is  said  to  be  supercritical.  W  is  said  to  be  strictly  subcritical  ifdim(W)  < 

T,]Usi  dim(^>i (IF)). 

In  this  language,  the  conditions  (3.8)  assert  that  that  every  subspace  W  of  V,  including  {0}  and  V  itself,  is 
subcritical;  equivalently,  there  are  no  supercritical  subspaces.  When  more  than  one  ?n-tuple  s  is  under  discussion, 
we  sometimes  say  that  W  is  critical,  subcritical,  supercritical  or  strictly  subcritical  with  respect  to  s. 

The  goal  of  Section  3.5  is  to  establish  the  following: 

Proposition  3.18.  Let  s  be  an  extreme  point  of  V .  Then  some  subspace  |0)  <  W  <  V  is  critical  with  respect  to 
s,  or  s£  {0,l}m. 

Note  that  these  two  possibilities  are  not  mutually  exclusive. 

Lemma  3.19.  If  s  is  an  extreme  point  ofV,  and  if  i  is  an  index  for  which  Sj  {0, 1},  then  dim(/>j(F))  ^  0. 
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Proof.  Suppose  dim(</>j(V))  =  0.  If  t  €  [0,  l]m  satisfies  tj  =  Sj  for  all  j  ^  i,  then  YlJLi  tj  dim(^(W))  = 
J2JLi  sj  dim(cj)j(W))  for  all  subspaces  W  <  V,  so  t  £  V  as  well.  If  Sj  ^  {Oil}!  then  this  contradicts  the  as¬ 
sumption  that  s  is  an  extreme  point  of  V.  □ 

Lemma  3.20.  Let  s  be  an  extreme  point  of  V .  Suppose  that  no  subspace  {0}  <  W  <  V  is  critical  with  respect  to 
s.  Then  s  £  {0,  l}m. 

Proof.  Suppose  to  the  contrary  that  for  some  index  i,  Si  ^  {0, 1}.  If  t  €  [0,  l]171  satisfies  tj  —  Sj  for  all  j  ^  i  and  if 
ti  is  sufficiently  close  to  sl.  then  t  £  V .  This  again  contradicts  the  assumption  that  s  is  an  extreme  point.  □ 

Lemma  3.21.  Let  s  be  an  extreme  point  ofV.  Suppose  that  there  exists  no  subspace  {0}  <  W  <  V  which  is  critical 
with  respect  to  s.  Then  there  exists  at  most  one  index  i  for  which  Si  {0, 1}. 

Proof.  Suppose  to  the  contrary  that  there  were  to  exist  distinct  indices  k,l  such  that  neither  of  Sk,si  belongs  to 
{0,1}.  By  Lemma  3.19,  both  (f>k{V)  and  4n{V)  have  positive  dimensions.  For  e  G  R.  define  t  by  tj  =  Sj  for  all 

j  i  {M}, 

tk  =  sk  +  edim(0i(V))  and  t;  =  St  -  e  dim(0fc(V)). 

Whenever  |er|  is  sufficiently  small,  t  £  [0,  l\m.  Moreover,  V  remains  subcritical  with  respect  to  t.  If  |e|  is  sufficiently 
small,  then  every  subspace  {0}  <  W  <  V  remains  strictly  subcritical  with  respect  to  t,  because  the  set  of  all 
parameters  (dim(M/),  dim(0i(W)), . . . ,  dim (<j>m(W)))  which  arise,  is  finite.  Thus  t  G  V  for  all  sufficiently  small  |e|. 
Therefore  s  is  not  an  extreme  point  of  V .  □ 


Lemma  3.22.  Let  s  £  [0,  l\m.  Suppose  that  V  is  critical  with  respect  to  s.  Suppose  that  there  exists  exactly  one 
index  i  £  {1, 2, . . . ,  to}  for  which  Si  {0, 1}.  Then  V  has  a  subspace  which  is  supercritical  with  respect  to  s. 


Proof.  By  Lemma  3.19,  dim(0j(F))  >  0.  Let  K  be  the  set  of  all  indices  k  for  which  Sk  =  1.  The  hypothesis  that 
V  is  critical  means  that 

dim(V)  =  Sj  dim(^i(V))  +  ^  sk  dim(0fe(V)). 

keK 

Since  Sj  >  0  and  dim(0j(F))  >  0, 


y,  dim (<j)k{V))  =  y  Sk  dim(0fc(V))  =  dim(V)  -  Sj  dim(^j(V))  <  dim(V). 

k&K  kGK 


Consider  the  subspace  W  <  V  dehned  by 

W=  Pi  ker(>fe); 
fee  A' 

this  intersection  is  interpreted  to  be  W  =  V  if  the  index  set  K  is  empty.  W  necessarily  has  positive  dimension. 
Indeed,  W  is  the  kernel  of  the  map  tf>:  V  — >  ©feeii-  <j>k(V),  defined  by  if(x)  =  (4>k(x)  :  k  £  K),  where  0  denotes  the 
direct  sum  of  vector  spaces.  The  image  of  %f>  is  isomorphic  to  some  subspace  of  0fegA-  <t>k(V),  a  vector  space  whose 
dimension  is  strictly  less  than  dim(C).  Therefore  keifif)  =  W  has  dimension  greater  than  or 

ecjual  to  dim(F)  —  YlkeK  dim(</)fc(kr))  >  0.  Since  (fkfW)  =  {0}  f°r  ^  e 

m 

y  Sj  dim {<t>j{W))  =  Si  dim(^j(W))  +  y  dim (cj>k(W)) 

j= 1  keK 

=  Si  dim (<^j(W)) 

<  Si  dim(IF). 


Since  Sj  <  1  and  dim(W)  >  0,  Sj  dim(</>i(IF))  is  strictly  less  than  dim(IT),  whence  W  is  supercritical.  □ 

Proof  of  Proposition  3.18.  Suppose  that  there  exists  no  critical  subspace  {0}  <  W  <  V.  By  Lemma  3.20,  either 
Sj  £  {0,  l}m  -  in  which  case  the  proof  is  complete  —  or  V  is  critical.  By  Lemma  3.21,  there  can  be  at  most 
one  index  i  for  which  s,;  (f  {0, 1}.  By  Lemma  3.22,  for  critical  V,  the  existence  of  one  single  such  index  i  implies 
the  presence  of  some  supercritical  subspace,  contradicting  the  main  hypothesis  of  Proposition  3.18.  Thus  again, 
se{0,  l}m.  '  □ 
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3.6  Factorization  of  HBL  data 


Notation  3.23.  Suppose  V,V'  are  finite- dimensional  vector  spaces  over  a  field  F,  and  </>:  V  — ►  V'  is  an  ¥ -linear 
map.  When  considering  a  fixed  subspace  W  <  V,  then  <j>\w'-  W  —>  4>(W)  denotes  the  restriction  of  f  to  W,  also  a 
F -linear  map.  V /W  denotes  the  quotient  ofV  by  W;  elements  ofV/W  are  cosets  x  +  W  =  {x  +  w  :  w  £  W}  CV 
where  x  €  V.  Thus  x  +  W  =  x'  +  W  if  and  only  if  x  —  x'  £  W.  V/W  is  also  a  (finite- dimensional)  vector  space, 
under  the  definition  (x  +  W)  +  (x'  +  W)  =  (x  +  x')  +  W ;  every  subspace  of  V/W  can  be  written  as  U /W  for  some 

w  <u  <v. 

Similarly  we  can  define  V'/<fi(W),  the  quotient  of  V'  by  4>(W),  and  the  quotient  map  [</>]:  V/W  9  a;  +  If  H 
<p(x)  +  <j>(W)  £  V'/(j)(W),  also  a  ¥ -linear  map.  For  any  U/W  <  V/W,  [<f](U /W)  =  4>(U)/(j)(W). 

Let  (V.  (Vj),  (<pj))  be  a  vector  space  HBL  datum.  To  any  subspace  W  <  V  can  be  associated  two  HBL  data: 

(W, (MW)), M\w))  and  (V/W, (Vj/MW)), ([fa])) 

Lemma  3.24.  Given  the  vector  space  HBL  datum  (V,  (Vj),  (<f>j)),  for  any  subspace  W  <  V, 

V(W,  (Vj),  M\w))  n  V(V/W,  (Vj/MW)),  ([&]))  Q  W  (Vj),  (M)  (3.12) 

Proof.  Consider  any  subspace  U  <  V  and  some  s  £  [0,  l]m  such  that  both  s  £  V(W,  (Vj),((j>j\ ))  and  s  £ 
nV/W, (Vj/MW)), &>,]))■  Then 

dim(C7)  =  dim((t/  +  W)/W)  +  dim(Z7  n  W) 

m  m 

{\mv+w)iw)))+Y.  Sj  dim (<pj(U  n  W)) 

3= 1  3= 1 

m  m 

=  st  AMMU  +  w)/mw ))  +  di  MMU  n  w)) 

3= 1  3=1 

m  m 

=  si  MMMU  +  W))  ^  dim(>j(H0))  +  ^  s3  dim  (MU  n  W)) 

3=1  3=1 

m 

=  ^  Sj  (dim(<j)j(U)  +  MW))  +  dim (<fj(U  n  W))  —  dim(0J(Hr))) 

3=1 

m 

<  si  (dim(^(c/)  +  &AW))  +  dim (Mu)  n  Mw))  -  &MMW))) 

3=1 

m 

=  '52S3dim(<t>3(U))- 

3=1 

The  last  inequality  is  a  consequence  of  the  inclusions  4>j(U  GW)  C  <j>j(U)  G(/j(W).  The  last  equality  is  the  relation 
dim(A)  +  dim(i?)  =  dim(H  +  B)  +  dim(H  n  B),  which  holds  for  any  subspaces  A,  B  of  a  vector  space.  Thus  U  is 
subcritical.  □ 

Lemma  3.25.  Given  the  vector  space  HBL  datum  (V,  (Vj),  (4>j)),  let  s  £  V(V,  (Vj),  (4>j))-  Let  W  <  V  be  a  subspace 
which  is  critical  with  respect  to  s.  Then 

v(v,  (Vj),  (M)  =  v(w,  (Vj),  M\w))n  v(v/w,  (Vj/MW)),  ((M))-  (3.13) 

Proof.  With  Lemma  3.24  in  hand,  it  remains  to  show  that  V(V,  (Vj),  (<j>j))  is  contained  in  the  intersection  of  the 
other  two  polytopes. 

Any  subspace  U  <  W  is  also  a  subspace  of  V.  U  is  subcritical  with  respect  to  s  when  regarded  as  a  subspace 
of  W,  if  and  only  if  U  is  subcritical  when  regarded  as  a  subspace  of  V.  So  s  £  V(W,  (Vj),  (<t>j\w))- 

Now  consider  any  subspace  of  U/W  <  W/V\  we  have  W  <  U  <  V  and  dim  (U/W)  =  dim({7)  —  dim  (IT). 
Moreover, 

dim  ([<t>j\(U/W))  =  dxm((j)j(U)  /  <fj(W))  =  dim(</>j([/))  —  dim  (<j>j(W)). 
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Therefore  since  dim(lF)  =  J2jLi  sj  dim (4>j(W)), 


dim (U/W)  =  dim(Z7)  —  dim(W)  <  Y,  Sj  dim (<t>j(U*))  —  Y^  Sj  dim (<j>j(W)) 

i-i  i=t 

m  m 

=  E  si  (dim(^'(C7))  “  di M<Pj(W)))  =  Y  Sj  dim ([<j>j\(U /W)) 

i= i  3=1 

by  the  subcriticality  of  U,  which  holds  because  s  G  V(V,  (Vj),  ((j>j)).  Thus  any  U/W  <  V/W  is  subcritical  with 
respect  to  s,  so  s  €  V{V/W,  (Vj/<f>j(W)),  ([<f>j]))  as  well.  □ 

3.7  Proof  of  Theorem  3.10 

Recall  we  are  given  the  vector  space  HBL  datum  (V,  (Vj).  (4>j))\  we  prove  Theorem  3.10  by  induction  on  the 
dimension  of  the  ambient  vector  space  V.  If  dim(H)  =  0  then  V  has  a  single  element,  and  the  result  (3.9)  is  trivial. 

To  establish  the  inductive  step,  consider  any  extreme  point  s  of  V.  According  to  Proposition  3.18,  there  are 
two  cases  which  must  be  analyzed.  We  begin  with  the  case  in  which  there  exists  a  critical  subspace  {0}  <  W  <  V , 
which  we  prove  in  the  following  lemma.  We  assume  that  Theorem  3.10  holds  for  all  HBL  data  for  which  the  ambient 
vector  space  has  strictly  smaller  dimension  than  is  the  case  for  the  given  datum. 

Lemma  3.26.  Let  (V,  (Vj),  (4>j))  be  a  vector  space  HBL  datum,  and  let  s  €  V(V,  (Vj),  (4>j))-  Suppose  that  subspace 
{0}  <  W  <  V  is  critical  with  respect  to  s.  Then  (3.9)  holds  for  this  s. 

Proof.  Consider  any  inequality  in  (3.9).  We  may  assume  that  none  of  the  exponents  equal  zero.  For  if  Sk  =  0,  then 
fk(f>k(x))  <  ||/fc||i/sfc  for  all  x,  and  therefore 

m 

^2Y[fi(Mx))  ^  \\fk\\i/ah  ■  Y  II  fMAx))- 

x€V  j= 1  x£V  j^k 

If  II  /fc  II  i/sfc  =  0:  then  (3-9)  holds  with  both  sides  0.  Otherwise  we  divide  by  ||/fc||i/Sfe  to  conclude  that  s  € 
V(V,  (Vj),  (cj>j))  if  and  only  if  (sj)j^k  belongs  to  the  polytope  associated  to  the  HBL  datum  (V,  (Vj)j^k,  (fj)j^tk)- 
Thus  the  index  k  can  be  eliminated.  This  reduction  can  be  repeated  to  remove  all  indices  which  equal  zero. 

Let  Wj  '■=  <j>j(W).  By  Lemma  3.25,  s  G  V(W,  (Wj),  (<p:)  \  w ) ) .  Therefore  by  the  inductive  hypothesis,  one  of  the 
inequalities  in  (3.9)  is 

m  m 

Y  (3-14) 

xEW  j—1  j—1 

Define  Fj  :  Vj /Wj  —>  [0,  oo)  to  be  the  function 

Fj(x  +  Wj)=[  Y  M"  • 

yGWj 

This  quantity  is  a  function  of  the  coset  x  +  Wj  alone,  rather  than  of  x  itself,  because  for  any  2  G  Wj, 

Y  fjiy  +  ix  +  z))1^  =  Y  fj(y  +  x)1/s> 

yeWj  y£Wj 


by  virtue  of  the  substitution  y  +  z  <— >  y.  Moreover, 

\\Fj\\l/Sj=\\f3\\l/sy  (3-15) 

To  prove  this,  choose  one  element  x  G  Vj  for  each  coset  x  +  W7  G  Vj/Wj.  Denoting  by  X  the  set  of  all  these 
representatives, 

11^-11%  =  EE  f3(y  +  x)1/Si  =  E  f^1/Sj 

xeXyeWj  zGVj 

because  the  map  X  x  Wj  5  (x,  y)  x  +  y  G  Vj  is  a  bijection. 
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The  inductive  bound  (3.14)  can  be  equivalently  written  in  the  more  general  form 


En  fj(Mx  +  y))  <  n  Fjimv  +  W))  (3.16) 

xeW  j—1  j—1 

for  any  y  €  V,  by  applying  (3.14)  to  (fj)  where  fj(z)  =  fj(z  +  </>j(y)). 

Denote  by  Y  CV  a  set  of  representatives  of  the  cosets  y  +  W  €  V/W,  and  identify  V/W  with  Y.  Then 

mm  m 

y  n  fi(Mx)) = y  n fj(Mv + a;))  <  y  n m&Mv + 

xeV  j= 1  yeYxeWj=  1  y+weV/W  3  =  1 


by  (3.16). 

By  (3.15),  it  suffices  to  show  that 

E  n^a«(#+ir))<n  II -Py  II l/sj  for  functions  0  <  Fj  £  I1^Sj  {Vj/Wj).  (3-17) 

y+WeV/W  j  j 

This  is  a  set  of  inequalities  of  exactly  the  form  (3.9),  with  (V,  (V)),  (<?4))  replaced  by  (V/W,(Vj/Wj),([<f>j])).  By 
Lemma  3.25,  s  €  V(V/W ,  (Vj/Wj),  ([<fo])),  and  since  dim  (V/W)  <  dim(V),  we  conclude  directly  from  the  inductive 
hypothesis  that  (3.17)  holds,  concluding  the  proof  of  Lemma  3.26.  □ 

According  to  Proposition  3.18,  in  order  to  complete  the  proof  of  Theorem  3.10,  it  remains  only  to  analyze  the 
case  where  the  extreme  point  s  G  {0,  l}m.  Let  K  =  {k  :  sk  =  1}-  Consider  W  =  P|fcgAr  ker(^fe).  Since  W  is 
subcritical  by  hypothesis, 

m 

dim(lT)  <  ^  Sj  dim(^(lT))  =  ^  dim(0fc(lT))  =  0 
f=i  keK 

so  dim(W)  =  0,  that  is,  W  =  {0}.  Therefore  the  map  x  >->•  (<j>k(x))keK  from  V  to  the  Cartesian  product  rw  vk 
is  injective. 

For  any  x  €  V, 

m 

^  n  fkiMx ))  n  Halloo = n  n  n^n  v* 

j—1  keK  i$K  keK  i^K 

since  s,  =  0  for  all  i  ^  K.  Thus  it  suffices  to  prove  that 

y  n  fk(Mx ))  <  n  n/fcii1- 

xeVkeK  keK 

This  is  a  special  case  of  the  following  result. 

Lemma  3.27.  Let  V  be  any  finite- dimensional  vector  space  over  F.  Let  K  be  a  finite  index  set,  and  for  each 
k  €  K,  let  fk  be  an  F -linear  map  from  V  to  a  finite- dimensional  vector  space  14.  If  f ~)keK  ker(cf>k)  =  {0}  then  for 
all  functions  fk'  14  — ►  [0,  oo), 

Y  II  fk(Mx ))  <  II  Utili¬ 

ze  v  keK  keK 

Proof.  Define  $:  V  — ►  Y\k&K  14  by  $(a;)  =  (<t>k(x))keK-  The  hypothesis  f)keK  ker(^fe)  =  {0}  is  equivalent  to  $ 
being  injective.  The  product  Jlfeeif  ll/fclli  can  be  expanded  as  the  sum  of  products 

Y  II  fk(yk> 

y  keK 

where  the  sum  is  taken  over  all  y  =  (yk)keK  belonging  to  the  Cartesian  product  WkeK  14-  The  quantity  of  interest, 

Y  II  fk(Mx )), 

xev  keK 


16 


is  likewise  a  sum  of  such  products.  Each  term  of  the  latter  sum  appears  as  a  term  of  the  former  sum,  and  by  virtue 
of  the  injectivity  of  $,  appears  only  once.  Since  all  summands  are  nonnegative,  the  former  sum  is  greater  than  or 
equal  to  the  latter.  Therefore 


n  n/feii1 = Yi  n  /fcW  -  n  mmx))- 

k&K  y  keK  x£V  keK 


□ 


Having  shown  sufficiency  for  extreme  points  of  V ,  we  apply  Lemma  3.16  to  conclude  sufficiency  for  all  s  £  V . 

As  mentioned  above,  necessity  in  the  case  F  =  Q  can  be  deduced  from  necessity  in  Theorem  3.2,  by  clearing 
denominators.  First,  we  identify  V  and  Vj  with  Qd  and  and  let  E  be  any  nonempty  finite  subset  of  Qd.  Let 
( fij :  Qd  — >  Qdj  be  the  linear  map  represented  by  the  matrix  of  < f>j  multiplied  by  the  lowest  common  denominator 
of  its  entries,  i.e. ,  an  integer  matrix.  Likewise,  let  E  be  the  set  obtained  from  E  by  multiplying  each  point  by  the 
lowest  common  denominator  of  the  coordinates  of  all  points  in  E.  Then  by  linearity, 

m  m 

\E\=\E\<n\ME)\-=n\m\-. 

3  =  1  j= 1 

Recognizing  (Zd,  (Zdj),  ((f) 3 1  d))  as  an  Abelian  group  HBL  datum,  we  conclude  (3.1)  for  this  datum  from  the  converse 
of  Theorem  3.2.  According  to  Lemma  3.14,  (3.8)  holds  for  the  vector  space  HBL  datum  (Qd,(Q dj),  ((f>j))',  our 
conclusion  follows  since  &\m(<j)j(W))  =  dim (<f>j(W))  for  any  W  <  Qd. 

It  remains  to  treat  the  case  of  a  finite  field  F.  Whereas  the  above  reasoning  required  only  the  validity  of  (3.10) 
in  the  weakened  form  \E\  <  \4>j{E)\Si  f°r  some  constant  C  <  oo  independent  of  E  (see  proof  of  necessity 

for  Theorem  3.6),  now  the  assumption  that  this  holds  with  (7  =  1  becomes  essential.  Let  W  be  any  subspace  of  V . 
Since  |F|  <  oo  and  W  has  finite  dimension  over  F,  IT  is  a  finite  set  and  the  hypothesis  (3.10)  can  be  applied  with 
E  =  W.  Therefore  \W\  <  TIJli  \(t)j^W)\Sj  •  This  is  equivalent  to 


|j7|dim(W)  <-  |jp|Sj  dim(0j(lV))^ 

3= 1 

so  since  |F|  >  2,  taking  base-|F|  logarithms  of  both  sides,  we  obtain  dirn(IT)  <  sj  dim(^(IT)),  as  was  to  be 
shown.  □ 


4  Communication  lower  bounds  from  Theorem  3.2 

In  this  section  we  introduce  a  concrete  execution  model  for  programs  running  on  a  sequential  machine  with  a 
two-level  memory  hierarchy.  With  slight  modification,  our  model  also  applies  to  parallel  executions;  we  give  details 
below  in  order  to  extend  our  sequential  lower  bounds  to  the  parallel  case  in  Section  4.6.  We  assume  the  memory 
hierarchy  is  program-managed,  so  all  data  movement  between  slow  and  fast  memory  is  seen  as  explicit  instructions, 
and  computation  can  only  be  performed  on  values  in  fast  memory.  We  combine  this  with  the  geometric  model  from 
Section  2  and  the  upper  bound  from  Theorem  3.2  to  get  a  communication  lower  bound  of  the  form  H(|iJ|/MSHBL^1), 
where  \Z\  is  the  number  of  inner  loop  iterations,  represented  by  the  finite  set  Z  C  Zd .  Then  we  discuss  how  the 
lower  bounds  extend  to  programs  on  more  complicated  machines,  like  heterogeneous  parallel  systems. 

In  addition  to  the  concrete  execution  model  and  the  geometric  model,  we  use  pseudocode  in  our  examples.  At 
a  high  level,  these  three  different  algorithm  representations  are  related  as  follows: 

•  A  concrete  execution  is  a  sequence  of  instructions  executed  by  the  machine,  according  to  the  model  detailed 
in  Section  4.1.  The  concrete  execution,  unlike  either  the  geometric  model  or  pseudocode,  contains  explicit 
data  movement  operations  between  slow  and  fast  memory. 

•  The  geometric  model  (Definition  2.2),  is  the  most  abstract,  and  is  the  foundation  of  the  bounds  in  Sec¬ 
tion  3.  Each  instance  (2.2)  of  the  geometric  model  corresponds  to  a  set  of  concrete  executions,  as  detailed  in 
Section  4.2. 
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•  A  pseudocode  representation,  like  our  examples  with  (nested)  for-loops,  identifies  an  instance  (2.2)  of  the 
geometric  model.  Since  the  bounds  in  Section  3  only  depend  on  (Zd,  (Zdt),  ( <f>j )),  one  may  vary  the  iteration 
space  Z  C  Zd,  the  order  it  is  traversed,  and  the  statements  in  the  inner  loop  body  (provided  all  m  arrays  are 
still  accessed  each  iteration),  to  obtain  a  different  instance  (2.2)  of  the  geometric  model  with  the  same  bound. 
So,  when  we  prove  a  bound  for  a  program  given  as  pseudocode,  we  are  in  fact  proving  a  bound  for  a  larger 
class  of  programs. 

The  rest  of  this  section  is  organized  as  follows.  Section  4.1  describes  the  concrete  execution  model  mentioned 
above,  and  Section  4.2  relates  the  concrete  execution  model  to  the  geometric  model.  Section  4.3  states  and  proves 
the  main  communication  lower  bound  result  of  this  paper,  Theorem  4.1.  Section  4.4  presents  a  number  of  examples 
showing  why  the  assumptions  of  Theorem  4.1  are  in  fact  necessary  to  obtain  a  lower  bound.  Section  4.5  looks  at  one 
of  these  assumptions  in  more  detail  (“no  loop  splitting”),  and  shows  that  loop  splitting  can  only  improve  (reduce) 
the  lower  bound.  Finally,  Section  4.6  discusses  generalizations  of  the  lower  bound  result  to  other  machine  models. 

4.1  Concrete  execution  model 

The  hypothetical  machine  in  our  execution  model  has  a  two-level  memory  hierarchy:  a  slow  memory  of  unbounded 
capacity  and  a  fast  memory  that  can  store  M  words  (all  data  in  our  model  have  one- word  width).  Data  movement 
between  slow  and  fast  memory  is  explicitly  managed  by  software  instructions  (unlike  a  hardware  cache),  and  data 
is  copied3  at  a  one- word  granularity  (we  will  discuss  spatial  locality  in  Part  2  of  this  work).  Every  storage  location 
in  slow  and  fast  memory  (including  the  array  elements  Aj{<j>j(X)))  has  a  unique  memory  address,  called  a  variable; 
since  the  slow  and  fast  memory  address  spaces  (variable  sets)  are  disjoint,  we  will  distinguish  between  slow  memory 
variables  and  fast  memory  variables.  When  a  fast  memory  variable  represents  a  copy  of  a  slow  memory  variable  v, 
we  refer  to  v  as  a  cached  slow  memory  variable;  in  this  case,  we  assume  we  can  always  identify  the  corresponding 
fast  memory  variable  given  v,  even  if  the  copy  is  relocated  to  another  fast  memory  location. 

We  define  a  sequential  execution  E  =  (e±,  e^,  ■  ■  ■ ,  era)  as  a  sequence  of  statements  e*  of  the  following  types: 

Read  read(v ):  allocates  a  location  in  fast  memory  and  copies  variable  v  from  slow  to  fast  memory. 

Write  write{v ):  copies  variable  v  from  fast  to  slow  memory  and  deallocates  the  location  in  fast  memory. 

Compute  compute({vi, . . . ,  t>„})  is  a  statement  accessing  variables  Vi, . . . ,  vn. 

Allocate  allocate(v)  introduces  variable  v  in  fast  memory. 

Free  free(v )  removes  variable  v  from  fast  memory. 

A  sequential  execution  defines  a  total  order  on  the  statements,  and  thereby  a  natural  notion  of  when  one 
succeeds  or  precedes  another.  We  say  that  a  Read  or  Allocate  statement  and  a  subsequent  Write  or  Free 
are  paired  if  the  same  variable  v  appears  as  an  operand  in  both  and  there  are  no  intervening  Reads, 

Writes,  or  Frees  of  v.  A  sequential  execution  is  considered  to  be  well  formed  if  and  only  if 

•  operands  to  Read  (resp.,  Write)  statements  are  uncached  (resp.,  cached)  slow  memory  variables, 

•  operands  to  Allocate  statements  are  uncached  slow  memory  variables  or  fast  memory  variables4, 

•  operands  to  Free  statements  are  cached  slow  memory  variables  or  fast  memory  variables, 

•  every  Read,  Allocate,  Write,  and  Free  statement  is  paired,  and 

•  every  Compute  statement  involving  variable  v  interposes  between  paired  Read/ Allocate  and  Write/Free  state¬ 
ments  of  v,  i.e. ,  each  operand  v  in  a  Compute  statements  resides  in  fast  memory  before  and  after  the  statement. 

Essentially,  fast  memory  variables  must  be  allocated  and  deallocated,  either  implicitly  (Read/Write)  or  explicitly 
(Allocate/Free),  while  slow  memory  variables  cannot  be  allocated/deallocated.  Given  the  finite  capacity  of  fast 

4  We  will  use  the  word  ‘copied’  but  our  analysis  does  not  require  that  a  copy  remains,  e.g.,  exclusive  caches. 

4If  a  fast  memory  variable  v  in  an  allocate(v)  statement  already  stores  a  cached  slow  memory  variable,  then  we  assume  the  system 
will  first  relocate  the  cached  variable  to  an  available  location  within  fast  memory) 


statement 

statement 

Allocates, 
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memory,  we  need  an  additional  assumption  to  ensure  the  memory  operations  are  well-defined.  Given  a  well- 
formed  sequential  execution  E  =  (e\, ... ,  en),  we  define  f  ootprinti(E)  to  be  the  fast  memory  usage  after  executing 
statement  e,;  in  the  program,  i.e. , 


footprinti(E) 


f° 

I  / ootprinti-i(E)  +  1 
I  f ootprinti-\(E ) 

{  footprinti-i(E)  —  1 


*  =  0  or  i  =  n 

if  e,;  is  a  Read  or  Allocate  statement, 
if  e,;  is  a  Compute  statement,  and 
if  e,;  is  a  Write  or  Free  statement. 


Then  E  is  said  to  be  M-fit  (for  fast  memory  size  M)  if  max*  footprinti(E)  <  M. 

It  is  of  practical  interest  to  permit  variables  to  reside  in  fast  memory  before  and  after  the  execution,  e.g.,  to 
handle  parallel  code  as  described  in  the  next  paragraph;  however,  this  violates  our  notion  of  well-formedness.  Rather 
than  redefine  well-formedness  to  account  for  this  possibility,  we  take  a  simpler  approach:  given  an  execution  that  is 
well  formed  except  for  the  I  (‘input’)  and  O  (‘output’)  variables  which  reside  in  fast  memory  before  and  after  the 
execution,  we  insert  up  to  M  Reads  and  M  Writes  at  the  beginning  and  end  of  the  execution  so  that  all  memory 
statements  are  paired,  and  then  later  reduce  the  lower  bound  (on  Reads/Writes)  by  I  +  O  <  2 M. 

Although  we  will  establish  our  bounds  first  for  a  sequential  execution,  they  also  apply  to  parallel  executions 
as  explained  in  Section  4.6.  We  define  a  parallel  execution  as  a  set  of  P  sequential  executions,  {E\,E2,  ■  ■  ■  ,Ep}. 
In  our  parallel  model,  the  global  (‘slow’)  memory  for  each  processor  is  a  subset  of  the  union  of  the  local  (‘fast’) 
memories  of  the  other  processors.  That  is,  for  a  given  processor,  each  of  its  slow  memory  variables  is  really  a  fast 
memory  variable  for  some  other  processor,  and  each  of  its  fast  memory  variables  is  a  slow  memory  variable  for 
every  other  processor,  unless  it  corresponds  to  a  cached  slow  memory  variable,  in  which  case  it  is  invisible  to  the 
other  processors.  (We  could  remove  this  last  assumption  by  extending  our  model  to  distinguish  between  copies  of  a 
slow  memory  variable.)  A  parallel  execution  is  well  formed  or  (additionally)  {Mi, . . . ,  Mp }— fit  if  each  of  its  serial 
executions  Ei  is  well  formed  or  (additionally)  M— fit.  Since  well-formedness  assumes  no  variables  begin  and  end 
execution  in  fast/local  memory,  it  seems  impossible  for  there  to  be  any  nontrivial  well- formed  parallel  execution.  As 
mentioned  above,  we  can  always  allow  for  a  sequential  execution  with  this  property  by  inserting  up  to  I  +  O  <  2 M 
Reads/ Writes,  and  later  reducing  the  lower  bound  by  this  amount;  we  insert  Reads/ Writes  in  this  manner  to  each 
sequential  execution  Ej  in  a  parallel  execution. 


4.2  Relation  to  the  geometric  model 

Recall  from  Section  2  our  geometric  model  (2.2): 

for  all  I  £  Z  C  Zd,  in  some  order, 

inner  _loop(X,  (Au  . . . ,  Am),  (fa,...,  <j>m)) 

The  subroutine  inner Toop  represents  a  given  ‘computation’  involving  arrays  A\, . . . ,  Am  referenced  by  corresponding 
subscripts  fa(T), . . . ,  </>m(X);  each  Aj :  (f>j(Zd)  — »  {variables}  is  an  injection  and  each  fa:  Zd  — ►  Zdi  is  a  Z-affine 
map.  Each  array  variable  Ai(fa(X)), . . . ,  Am((f>m(T))  is  accessed  in  each  inner _loop(X),  as  an  input,  output,  or 
both,  perhaps  depending  on  the  iteration  X. 

Given  an  execution,  we  assume  we  can  discern  the  expression  Aj(fa(I))  from  any  variable  which  represents  an 
array  variable  in  the  geometric  model.  The  execution  may  contain  additional  variables  that  act  as  surrogates  (or 
copies)  of  the  variables  specified  in  the  program  text.  As  an  extreme  example,  an  execution  could  use  an  array  A'- 
as  a  surrogate  for  the  array  Aj  in  the  computations,  and  then  later  set  Aj  to  Aj.  In  such  examples,  one  can  always 
associate  each  surrogate  variable  with  the  ‘master’  copy,  and  there  is  no  loss  of  generality  in  our  analysis  to  assume 
all  variables  are  in  fact  the  master  copies. 

We  say  a  legal  sequential  execution  of  an  instance  of  the  geometric  model  is  a  sequential  execution  whose 
subsequence  of  Compute  statements  can  be  partitioned  into  contiguous  chunks  in  one-to-one  correspondence  with 
Z,  and  furthermore  all  m  array  variables  Aj(fa(X))  appear  as  operands  in  the  chunk  corresponding  to  X.  Given  a 
possibly  overlapping  partition  Z  =  (J^  Zp  a  legal  parallel  execution  is  a  parallel  execution  {£}, . . . ,  Ep}  where  each 
sequential  execution  Ei  is  legal  with  respect  to  loop  iterations  I  6  Zt.  Legality  restricts  the  set  of  possible  concrete 
executions  we  consider  (for  a  given  instance  of  the  geometric  model),  and  in  general  is  a  necessary  requirement  for 
the  lower  bound  to  hold  for  all  concrete  executions.  For  example,  transforming  the  classical  algorithm  for  matrix 
multiplication  into  Strassen’s  algorithm  (which  can  move  asymptotically  less  data)  is  illegal,  since  it  exploits  the 
distributive  property  to  interleave  computations,  and  any  resulting  execution  cannot  be  partitioned  contiguously 
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according  to  the  original  iteration  space  Z.  As  another  example,  legality  prevents  loop  splitting ,  an  optimization 
which  can  invalidate  the  lower  bound  as  discussed  in  Section  4.5. 

We  note  that  there  are  no  assumptions  about  preserving  dependencies  in  the  original  program  or  necessarily 
computing  the  correct  answer.  Restricting  the  set  of  executions  further  to  ones  that  are  correct  may  make  the  lower 
bound  unattainable,  but  does  not  invalidate  the  bound. 


4.3  Derivation  of  Communication  Lower  Bound 


Now  we  present  the  more  formal  derivation  of  the  communication  lower  bound,  which  was  sketched  in  Section  2. 
The  approach  used  here  was  introduced  in  [13]  and  generalized  in  [4].  Here  we  generalize  it  again  to  deal  with  the 
more  complicated  algorithms  considered  in  this  paper. 

In  this  work,  we  are  interested  in  the  asymptotic  communication  complexity,  in  terms  of  parameters  like  the 
fast  memory  size  M  and  the  number  of  inner  loop  body  iterations  \Z\.  We  treat  the  algorithm’s  other  parameters, 
like  the  number  of  array  references  m ,  the  dimension  of  the  iteration  space  d ,  the  array  dimensions  dj ,  and  the 
coefficients  in  subscripts  <j>j,  as  constants.  When  discussing  an  upper  bound  F  on  the  number  of  loop  iterations 
doable  with  operands  in  a  fast  memory  of  M  words,  we  assume  F  =  f 2(M),  since  to  is  a  constant.  So,  our  asymptotic 
lower  bound  may  hide  a  multiplicative  factor  c(m,  d,  {</i, . . . ,  <j)m}),  and  this  constant  will  also  depend  on  whether 
any  of  the  arrays  A\, . . . ,  Am  overlap.  (Constants  are  important  in  practice,  and  will  be  addressed  in  Part  2  of 
this  work.)  Lastly,  we  assume  d  >  0,  otherwise  \Z\  <  1  and  the  given  algorithm  is  already  trivially  communication 
optimal,  and  we  assume  to.  >  0,  otherwise  there  are  no  arrays  and  thus  no  data  movement. 

Given  an  Af-fit  legal  sequential  execution  E  =  (ei, . . . ,  e„),  we  proceed  as  follows: 


1.  Break  E  into  M -Read/Write  segments  of  consecutive  statements,  where  each  segment  (except  possibly  the  last 
one)  contains  exactly  M  Reads  and/or  Writes.  Each  segment  (except  the  last)  ends  with  the  Mth  Read/Write 
and  the  next  segment  starts  with  whatever  statement  follows.  The  last  segment  may  have  statements  other 
than  Reads/ Writes  at  the  end  to  complete  the  execution.  (We  will  simply  refer  to  these  as  Read/ Write 
segments  when  M  is  clear  from  context.) 

2.  Independently,  break  E  into  Compute  segments  of  consecutive  statements  so  that  the  Compute  statements 
within  a  segment  correspond  to  the  same  iteration  X.  (It  will  not  matter  that  this  does  not  uniquely  define  the 
first  and  last  statements  of  a  Compute  segment.)  Our  assumption  of  a  legal  execution  guarantees  that  there 
is  one  Compute  segment  per  iteration  X.  We  associate  each  Compute  segment  with  the  (unique)  Read/Write 
segment  that  contains  the  Compute  segment’s  first  Compute  statement. 

3.  Using  the  limited  availability  of  data  in  any  one  Read/Write  segment,  we  will  use  Theorem  3.2  to  establish  an 
upper  bound  F  on  the  number  of  complete  Compute  segments  that  can  be  executed  during  one  Read/ Write 
segment  (see  below).  This  is  an  upper  bound  on  the  number  of  complete  loop  iterations  that  can  be  executed. 

4.  Now,  we  can  bound  below  the  number  of  complete  Read/ Write  segments  by  \\Z\/(F  +  1)J .  We  add  1  to  F  to 
account  for  Compute  segments  that  overlap  two  (or  more)  Read/ Write  segments.  We  need  the  floor  function 
because  the  last  Read/Write  segment  may  not  contain  M  Reads/Writes.  (Since  we  are  doing  asymptotic 
analysis,  \Z\  can  often  be  replaced  by  the  total  number  of  Compute  statements.) 

5.  Finally  we  bound  below  the  total  number  of  Reads/Writes  by  the  lower  bound  on  the  number  of  complete 
Read/ Write  segments  times  the  number  of  Reads/ Writes  per  such  segment,  minus  the  number  of  Reads/ Writes 
we  inserted  to  account  for  variables  residing  in  fast  memory  before/after  the  execution: 


M 


\Z\ 

F+l 


(i  +  o)  =  n 


fM  ■  \Z\ 


-  O(M), 


(4.1) 


where  we  have  applied  our  asymptotic  assumption  F  =  fl(Af)  =  w(l). 


To  determine  an  upper  bound  F ,  we  will  use  Theorem  3.2  to  bound  the  amount  of  (useful)  computation  that 
can  be  done  given  only  0( M)  array  variables.  First,  we  discuss  how  to  ensure  that  only  a  fixed  number  of  array 
variables  is  available  during  a  single  Read/ Write  segment. 

Given  an  AT— fit  legal  sequential  execution,  consider  any  Af-Read/ Write  segment.  There  are  at  most  M  array 
variables  in  fast  memory  when  the  segment  starts,  at  most  M  array  variables  are  read/written  during  the  segment, 
and  at  most  M  array  variables  remain  in  fast  memory  when  the  segment  ends.  If  there  are  no  Allocates  of  array 
variables,  or  if  there  are  no  Frees  of  array  variables,  then  at  most  2Af  distinct  array  variables  appear  in  the  segment 
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(at  most  M  may  already  reside  in  fast  memory,  and  at  most  M  more  can  be  read  or  allocated).  More  generally, 
if  there  are  no  paired  Allocate/Free  statements  of  array  variables,  then  at  most  3 M  array  variables  appear  in  the 
segment  (at  most  M  already  reside  in  fast  memory,  at  most  M  can  be  read,  and  at  most  M  can  be  allocated). 
However,  if  we  allow  array  variables  to  be  allocated  and  subsequently  freed,  then  it  is  possible  to  have  an  unbounded 
number  of  array  variables  contributing  to  computation  in  the  same  segment;  this  can  occur  in  practice  and  we  give 
concrete  examples  in  the  following  section.  Thus,  we  need  an  additional  assumption  in  order  to  obtain  a  lower 
bound  that  is  valid  for  all  executions. 

We  will  assume  that  the  execution  contains  no  paired  Allocate/Free  statements  of  array  variables.  However,  we 
note  that  of  these  paired  statements,  we  only  need  to  avoid  the  ones  where  both  statements  occur  within  a  given 
Read/Write  segment;  e.g.,  one  could  remove  the  preceding  assumption  by  proving  that  at  least  M  Read/Write 
statements  (of  variables  besides  v)  interpose  every  paired  Allocate/Free  of  an  array  variable  v.  (This  weaker 
assumption  is  equivalent  to  an  assumption  in  [4,  Section  2]  that  there  are  no  ‘R2/D2  operands.’) 

The  communication  lower  bound  is  now  a  straightforward  application  of  Theorem  3.2. 

Theorem  4.1.  Consider  an  algorithm  in  the  geometric  model  of  (2.2)  in  Section  2,  and  consider  any  M-fit,  legal 
sequential  execution  which  contains  no  paired  Allocate/Free  statements  of  army  variables.  If  the  linear  constraints 
(3.1)  of  Theorem  3.2  are  feasible,  then  for  sufficiently  large  \Z\,  the  number  of  Reads/Writes  in  the  execution  is 
fi(|Z|/MSHBL~1),  where  shbl  is  the  minimum  value  ofY^JLisj  subject  to  (3.1). 

Proof.  The  bound  F  may  be  derived  from  Theorem  3.2  as  follows.  If  E  C  Z  is  the  (finite)  set  of  complete  inner 
loop  body  iterations  executed  during  a  Read/Write  segment,  then  we  may  bound  \E\  <  n'li  |0y (-E1) Is-7  for  any 
s  =  (si,...,sm)  satisfying  (3.1).  By  our  assumptions,  each  M-Read/ Write  segment  has  at  most  3 M  distinct 
array  variables  whose  values  reside  in  fast  memory.  This  implies  that  maxj  \<f>j(E)\  <  3 M,  since  we  allow  arrays 
to  alias5.  So,  \E\  <  rij=i(3Af)Si  =  (3M)^-lSh  Since  this  bound  applies  for  any  s  satisfying  (3.1),  we  choose 
an  s  minimizing  Ab  obtaining  the  tightest  bound  \E\  <  (3M)SHBL  =:  F.  The  communication  lower  bound 

is  f \{M  ■  \Z\/F)  —  O(M);  if  we  assume  \Z\  =  ui(F),  i.e. ,  the  iteration  space  is  ‘sufficiently  large,’  then  we  obtain 
fi(|Z|/MSHBL_1)  as  desired.  □ 

When  \Z\  =  0(F),  i.e.,  the  problem  (iteration  space)  is  not  sufficiently  large,  the  lower  bound  becomes  Q(M)  — 
O(M),  so  the  subtractive  O(M)  term  may  dominate  and  lead  to  zero  communication;  this  is  increasingly  likely  as 
the  ratio  \Z\/F  goes  to  zero.  The  parallel  case  also  demonstrates  this  behavior  in  the  ‘strong  scaling’  limit,  when  the 
problem  is  decomposed  to  the  point  that  each  processor’s  working  set  fits  in  their  local  memory  (see  Section  4.6). 
In  the  regime  \Z\  =  0(F),  a  memory-independent  lower  bound  [3]  provides  more  insight  than  the  bound  above. 
Let  W  be  the  (unknown)  number  of  Reads/Writes  performed,  and  let  I  and  O  be  the  numbers  of  input/output 
variables  residing  in  fast  memory  before/after  execution.  Then 

\Z\  <  TT  \(t>j(Z)\st  <  (max|^(Z)|)SHBL  <  (W  +  I  +  0)SHBL, 

j 

3 

and  we  have  W  >  |Z|1/SHBL  —  (I  +  O);  see  Section  7.3  for  further  discussion.  While  this  bound  applies  for  any 
\Z\  >  1,  the  memory  -dependent  bound  in  Theorem  4.1  is  tighter  when  \Z\  is  sufficiently  large;  when  \Z\  is  not  so 
large,  the  memory-independent  bound  may  be  tighter.  In  Part  1  of  this  work,  we  are  interested  in  lower  bounds  for 
large  problems,  and  so  only  consider  the  memory-dependent  bound.  In  Part  2,  we  will  revisit  the  case  of  smaller 
problems  when  we  discuss  the  constants  hidden  in  our  asymptotic  bounds. 

4.4  Examples 

We  give  four  examples  to  demonstrate  why  our  assumptions  in  Theorem  4.1  are  necessary.  Then,  we  discuss  how 
one  can  sometimes  deal  with  the  presence  of  imperfectly  nested  loops,  paired  Allocate/Free  statements,  and  an 
infeasible  linear  program  to  compute  a  useful  lower  bound;  we  give  an  example  of  this  approach. 

Example:  Modified  Matrix  Multiplication  with  paired  Allocates/FYees  (I).  The  following  simple  modi¬ 
fication  of  matrix  multiplication  demonstrates  how  paired  Allocate/Frees  can  invalidate  our  lower  bound. 

for  i\  =  1  :  N,  for  i2  =  1  :  N,  for  i3  =  1  :  N, 

C(iui2)  =  A(ii,i3)  ■  B(i3,i2) 
t  =  t  +  C(i\,i2 )7 

5If  the  arrays  do  not  alias,  then  the  tighter  constraint  \<fij(E)\  <  3 M  holds  instead,  although  our  bound  here  is  still  valid.  We 

will  discuss  tightening  our  bounds  for  nonaliasing  arrays  in  Part  2  of  this  work. 
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Suppose  we  know  the  arrays  do  not  alias  each  other.  Clearly  C  only  depends  on  the  data  when  i3  =  A,  but 
we  need  to  do  all  A 3  multiplications  to  compute  t  correctly.  But  the  same  analysis  from  Section  3  applies  to 
these  loops  as  to  matrix  multiplication,  suggesting  a  sequential  communication  lower  bound  of  Q(N3  /M1/2). 
However,  it  is  clearly  possible  to  execute  all  A3  iterations  moving  only  0(N3 /M  +  A2)  words,  by  hoisting 
the  (unblocked)  *3  loop  outside  and  blocking  the  i\  and  i2  loops  by  M/2,  doing  (M/2)2  multiplications  in  a 
Read/Write  segment  using  M/2  entries  each  of  A(-,i3)  and  B(i3,  •),  and  (over)writing  the  values  C(ii ,  i2)  to 
a  single  location  in  fast  memory,  which  is  repeatedly  Allocated  and  Freed.  Only  when  i3  =  A  would  C(ii,«2) 
actually  be  written  to  slow  memory.  So  in  this  case  there  are  a  total  of  N3  —  N2  paired  Allocates/Frees, 
corresponding  to  the  overwritten  C(i\,i2)  operands.  A 

Example:  Modified  Matrix  Multiplication  with  Paired  Allocate/Frees  (II).  This  example  also  demon¬ 
strates  how  paired  Allocate/Frees  can  invalidate  our  lower  bound.  Consider  the  following  code: 

for  ii  =  l:  A,  for  i2  =  1  :  A,  A(h,i2)  =  e2™(ii-i)(i2-i)/iv 

for  *i  =  l:  A,  for  i2  =  1  :  A,  for  i3  =  1  :  A,  C(ii,i2)  +=  A(ii,i3)  ■  B(i3,i2) 

Again,  suppose  we  know  that  the  arrays  do  not  alias  each  other.  Toward  a  lower  bound,  we  ignore  the 
initialization  of  A  (first  loop  nest)  and  only  look  at  the  second  loop  nest,  a  matrix  multiplication.  However, 
by  computing  entries  of  A  on-the-fly  from  X  and  discarding  them  (i.e. ,  Allocating/Freeing  them),  one  can 
beat  the  lower  bound  of  tt(N3 /M1/2)  words  for  matrix  multiplication.  That  is,  by  hoisting  the  (unblocked)  i2 
loop  outside  and  blocking  the  i\  and  i3  loops  by  M/2,  and  finally  writing  each  A(i\,  i3 )  to  slow  memory  when 
i2  =  A,  we  can  instead  move  0(A3/M+A2)  words.  So,  there  are  A3  — A2  possible  paired  Allocate/Frees.  A 

Example:  Infeasibility.  This  example  demonstrates  how  infeasibility  of  the  linear  constraints  (3.1)  of  Theorem  3.2 
can  invalidate  our  lower  bound.  Consider  the  following  code: 

for  ii  =  1  :  Ni,  for  i2  =  1  :  N2, 

A(i\)  =  func(A(*i)) 

It  turns  out  the  linear  constraints  (3.1)  are  infeasible,  so  we  cannot  apply  Theorem  3.2  to  find  a  bound  on 
data  reuse  of  the  form  MSHBL~1.  It  is  easy  to  see  that  only  2Ai  Reads  and  Writes  of  A  are  needed  to  execute 
the  inner  loop  N\-N2  times,  i.e.,  unbounded  (A^-fold)  data  reuse  is  possible.  In  general,  infeasibility  suggests 
that  the  number  of  available  array  variables  in  fast  memory  during  a  Read/ Write  segment  is  constrained  only 
by  the  iteration  space  Z.  (We  will  explore  infeasibility  further  in  Sections  6.2.1  and  7.1.1.) 

While  infeasibility  may  be  sufficient  for  there  to  be  an  unbounded  number  of  array  variables  in  a  Read/ Write 
segment,  the  previous  two  examples  show  that  it  is  not  necessary,  since  their  linear  programs  are  feasible.  We 
will  be  more  concrete  about  this  relationship  between  infeasibility  and  unbounded  data  reuse  in  Part  2.  A 

Example:  Loop  interleaving.  This  example  demonstrates  how  an  execution  which  interleaves  the  inner  loop 
bodies  (an  illegal  execution)  can  invalidate  our  lower  bound;  see  also  Section  4.5.  We  will  see  in  Section  4.5 
that  the  lower  bound  for  each  split  loop  is  no  larger  than  the  lower  bound  for  the  original  loop.  Consider 
splitting  the  two  lines  of  the  inner  loop  body  in  the  Complicated  Code  example  (see  Section  1)  into  two 
disjoint  loop  nests  (each  over  I  £  {1, . . . ,  A}6).  We  assume  funci  and  func2  do  not  modify  their  arguments, 
and  that  the  arrays  do  not  alias  —  the  two  lines  share  only  read  accesses  to  one  array,  A3,  so  correctness 
is  preserved.  As  will  be  seen  later  by  using  Theorem  7.1,  the  resulting  two  loop  nests  have  lower  bounds 
H(A6/M3/2)  and  H(A6/M2),  resp.,  both  better  than  the  fi(A6/M8/7)  of  the  original,  and  both  these  lower 
bounds  are  attainable.  A 

Theorem  4.1  is  enough  for  many  direct  linear  algebra  computations  such  as  (dense  or  sparse)  LU  decomposition, 
which  do  not  have  paired  Allocate/Frees,  but  not  all  algorithms  for  the  QR  decomposition  or  eigenvalue  problems, 
which  can  potentially  have  large  numbers  of  paired  Allocates/Frees  (see  [4]).  We  can  often  deal  with  interleaving 
iterations,  paired  Allocates/Frees,  and  infeasibility  of  (3.1)  by  imposing  Reads  and  Writes  [4,  Section  3.4]:  we 
modify  the  algorithm  to  add  (“impose”)  Reads  and  Writes  of  array  variables  which  are  allocated/freed  or  repeatedly 
overwritten,  apply  the  lower  bound  from  Theorem  4.1,  and  then  subtract  the  number  of  imposed  Reads  and  Writes 
to  get  the  final  lower  bound.  (Note  that  we  have  already  used  a  similar  technique,  above,  to  allow  an  execution  to 
begin/end  with  a  nonzero  fast  memory  footprint.)  We  give  an  example  of  this  approach  (see  also  [4,  Corollary  5.1]). 
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Example:  Matrix  Powering  (Part  1/4).  Consider  computing  B  =  Ak  using  the  following  code,  shown  (for 
simplicity)  for  odd  k  and  initially  B  —  A: 


/ /  Original  Code 
for  ii  =  1  :  [k/2\ 
C  =  AB 
B  —  A  -  C 


In  order  to  get  the  correct  answer,  we  assume  the  arrays  do  not  alias  each  other.  Under  our  asymptotic  assumption 
that  the  number  m  of  arrays  accessed  in  the  inner  loop  is  constant  with  respect  to  \Z\,  one  cannot  simply  model  the 
code  as  a  1-deep  loop  nest  (over  A),  since  each  entry  of  A ,  B,  C  would  be  considered  an  array.  Considering  instead 
the  scalar  multiply/adds  as  the  ‘innermost  loop,’  we  violate  an  assumption  of  the  geometric  model  (2.2):  since 
C  =  A  ■  B  needs  to  be  (partially)  completed  before  B  =  A  ■  C  can  be  computed,  the  innermost  loops  necessarily 
interleave.  Toward  obtaining  a  lower  bound,  we  could  try  to  apply  our  theory  to  a  perfectly  nested  subset  of  the 
code,  omitting  B  =  A  ■  C;  as  will  be  shown  in  part  2/4  of  this  example  (see  Section  6),  the  corresponding  linear 
constraints  in  (3.1)  are  then  infeasible,  violating  another  assumption  of  Theorem  4.1.  The  same  issue  arises  if 
we  omit  C  =  A  ■  B;  furthermore,  neither  modification  prevents  the  possibility  of  paired  Allocates/Frees  of  array 
variables  of  B,  C.  We  deal  with  all  three  violations  by  rewriting  the  algorithm  in  the  model  (2.2)  and  imposing 
Reads  and  Writes  of  all  intermediate  powers  A 1  with  1  <  *4  <  k,  so  at  most  2{k  —  2)N2  Reads  and  Writes  altogether. 
This  can  be  expressed  by  using  one  array  A  with  three  subscripts,  where  A{1,  *2,  *3)  :=  Afa,  *3)  and  all  other  entries 
are  zero-initialized. 


/ /  Modified  Code  (Imposed  Reads  and  Writes) 

for  *i  =  2:  k,  for  *2  =  1:  N,  for  *3  =  1  :  N,  for  *4  =  1  :  N, 

-4(*i,  *2,  h)  +=  -4(1,  h,  u)  ■  -4(*i  -  1,  *4,  *3) 

This  code  clearly  matches  the  geometric  model  (2.2),  and  admits  legal  executions  (which  would  be  interleaving 
executions  of  the  original  code).  As  will  be  shown  in  part  3/4  of  this  example  (see  Section  6),  the  linear  constraints 
(3.1)  are  now  feasible,  and  the  resulting  exponent  from  Theorem  4.1  will  be  shbl  =  3/2,  so  if  the  matrices  are  all 
N-by-N,  the  lower  bound  of  the  original  program  will  be  n(fc7V3/MSHBL-1)  —  2 (k  —  2)N2 .  For  simplicity,  suppose 
that  this  can  be  rewritten  as  Q(kN3  /M1'2  —  kN2).  So  we  see  that  when  the  matrices  are  small  enough  to  fit  in  fast 
memory  M,  i.e. ,  N  <  M1/2,  the  lower  bound  degenerates  to  0,  which  is  the  best  possible  lower  bound  which  is  also 
proportional  to  the  total  number  of  loop  iterations  \Z\  =  (k  —  1)1V3.  But  for  larger  N,  the  lower  bound  simplifies 
to  Q(kN3 /M1/2)  which  is  in  fact  attained  by  the  natural  algorithm  that  does  k  —  1  consecutive  (optimal)  N—by—N 
matrix  multiplications. 

This  example  also  illustrates  that  our  results  will  only  be  of  interest  for  sufficiently  large  problems,  certainly 
where  the  floor  function  in  the  lower  bound  (4.1)  is  at  least  1.  A 

The  above  approach  covers  many  but  not  all  algorithms  of  interest.  We  refer  to  reader  to  [4,  Sections  3.4  and  5] 
for  more  examples  of  imposing  Reads  and  Writes,  and  [4,  Section  4]  on  orthogonal  matrix  factorizations  for  an 
important  class  of  algorithms  where  a  subtler  analysis  is  required  to  deal  with  paired  Allocates/Frees. 

Imposing  Reads  and  Writes  may  fundamentally  alter  the  program,  so  the  lower  bound  obtained  for  the  modified 
code  need  not  apply  to  the  original  code.  In  the  example  above,  one  could  reorder6  the  original  code  to 

for  *2  =  1  :  N,  for  *3  =  1  :  N,  for  *4  =  1  :  N,  for  *4  =  1  :  [k/2\ , 

C{i 2,  *3)  +=  -4(*2,  *4)  •  B(*4,  *3) 

B{i 2,  *3)  +=  -4(*2,  *4)  •  C(u,  *3)- 

Recall  that  our  lower  bounds  are  valid  for  any  reordering  of  the  iteration  space,  correct  or  otherwise.  Since  *4  does 
not  appear  in  the  inner  loop  body,  data  movement  is  independent  of  k.  In  fact,  this  code  moves  0(N3 /M1/2) 
words,  asymptotically  beating  the  lower  bound  for  the  modified  code  (with  imposed  Reads/ Writes) .  For  another 
example,  see  [4,  Section  5.1.3].  In  Part  2,  we  will  present  an  alternative  to  imposing  Reads/Writes  which  can  yield  a 
(nontrivial)  lower  bound  on  the  original  code:  roughly  speaking,  one  ignores  the  loops  whose  indices  do  not  appear 
in  any  subscripts  (e.g.,  *4,  above).  While  this  approach  will  eliminate  infeasibility,  paired  Allocates/Frees  may  still 
arise  and  imposing  Reads/ Writes  may  still  be  necessary. 

6For  simplicity,  and  without  reducing  data  movement,  this  code  performs  (k  —  1  )N2  additional  +  operations. 
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4.5  Loop  Splitting  Can  Only  Help 

Here  we  show  that  loop  splitting  can  only  reduce  (improve)  the  communication  lower  bound  expressed  in  Theo¬ 
rem  4.1.  More  formally,  we  state  this  as  the  following. 

Theorem  4.2.  Suppose  we  have  an  algorithm  satisfying  the  hypotheses  of  Theorem  f.l,  with  lower  bound  determined 
by  the  value  shbl-  Also  suppose  that  this  algorithm  can  be  rewritten  ast  >  1  consecutive  (disjoint)  loop  nests,  where 
each  has  the  same  iteration  space  as  the  original  algorithm  but  accesses  a  subset  of  the  original  array  entries,  and 
each  satisfies  the  hypotheses  of  Theorem  f.l,  leading  to  exponents  shbl,?:  fori  €  {1, . . .  ,t}.  Then  each  shbl,i  >  shbl. 
i.e.,  the  communication  lower  bound  for  each  new  loop  nest  is  asymptotically  at  least  as  small  as  the  communication 
lower  bound  for  the  original  algorithm. 

Proof.  By  our  assumptions,  shbl  is  the  minimum  value  of  i  sj  where  s  €  [0,  l]m  satisfies  the  constraints  (3.1). 
We  need  to  prove  that  if  we  replace  these  constraints  by 

rank(tf)  <  E-i  for  all  subgroups  H  <  (4-2) 

jeLi 


where  Li  is  any  given  nonempty  subset  of  {1, ... ,  m}  corresponding  to  the  ith  (of  t)  new  loop  nest,  then  the  minimum 
value  of  shbl,j  =  sj  f°r  any  s  satisfying  (4.2)  must  satisfy  shbl,*  >  shbl-  We  proceed  by  replacing  both 

sets  of  constraints  by  a  common  finite  set,  yielding  linear  programs,  and  then  use  duality. 

As  mentioned  immediately  after  the  statement  of  Theorem  3.2,  even  though  there  are  infinitely  many  subgroups 
H  <  there  are  only  finitely  many  possible  values  of  each  rank,  so  we  may  choose  a  finite  set  S  of  subgroups 
H  <  Zd  that  yield  all  possible  constraints  (3.1).  Similarly,  there  is  a  finite  set  of  subgroups  Si  that  yields  all  possible 
constraints  (4.2).  Now  let  S*  =  SUSp  we  may  therefore  replace  both  (3.1)  and  (4.2)  by  a  finite  set  of  constraints, 
for  the  subgroups  H  G  S*  =  {Hi, . . . ,  Hg}. 

This  lets  us  rewrite  (3.1)  as  the  linear  program  of  minimizing  shbl  =  HjL i  sj  =  lm  '  s  subject  to  sT  ■  A  >  rT, 
where  s  =  (si, . . .  ,sm)T ,  lm  is  a  column  vector  of  ones,  r  =  (rank(fdi), . . . ,  rank (Hg))T ,  and  A  is  m-  by —g  with 
Ay  =  rank^^iij)).  Assuming  without  loss  of  generality  that  the  constraints  Li  are  the  first  constraints  Li  = 
{1,2,...,  |A;|},  we  may  similarly  rewrite  (4.2)  as  the  linear  program  of  minimizing  Shbl.j  =  X^=i  sj  =  Ijz,  |  '  sl, 
subject  to  si  ■  Api  >  rT,  where  A sl,  and  1|£.|  consist  of  the  first  \L.)  rows  of  A,  s  and  lm,  resp.  The  duals  of 
these  two  linear  programs,  which  have  the  same  optimal  values  as  the  original  linear  programs,  are 

maximize  rT  ■  x  subject  to  A  •  x  <  lm  (4-3) 

and 

maximize  rT  ■  Xl,  subject  to  Api  •  xl ,  <  1|l,|  (4-4) 

resp.,  where  both  x  and  xl ,  have  dimension  g.  It  is  easy  to  see  that  (4.4)  is  maximizing  the  same  quantity  as  (4.3), 
but  subject  to  a  subset  of  the  constraints  of  (4.3),  so  its  optimum  value,  shbl,*,  must  be  at  last  as  large  as  the 
other  optimum,  shbl-  □ 

While  loop  splitting  appears  to  always  be  worth  attempting,  in  practice  data  dependencies  limit  our  ability  to 
perform  this  optimization;  we  will  discuss  the  practical  aspects  of  loop  splitting  further  in  Part  2  of  this  work. 

4.6  Generalizing  the  machine  model 

Earlier  we  said  the  reader  could  think  of  a  sequential  algorithm  where  the  fast  memory  consists  of  a  cache  of  M 
words,  and  slow  memory  is  the  main  memory.  In  fact,  the  result  can  be  extended  to  the  following  situations: 

Multiple  levels  of  memory.  If  a  sequential  machine  has  a  memory  hierarchy,  i.e.,  multiple  levels  of  cache  (most 
do),  where  data  may  only  move  between  adjacent  levels,  and  arithmetic  done  only  on  the  “top”  level,  then 
it  is  of  interest  to  bound  the  data  transferred  between  every  pair  of  adjacent  levels,  say  i  and  i  +  1,  where 
i  is  higher  (faster  and  closer  to  the  arithmetic  unit)  than  i  +  1.  In  this  case  we  apply  our  model  with  M 
representing  the  total  memory  available  in  levels  1  through  i,  typically  an  increasing  function  of  i. 

Homogeneous  parallel  processors.  We  call  a  parallel  machine  homogeneous  if  it  consists  of  P  identical  proces¬ 
sors,  each  with  its  own  memory,  connected  over  some  kind  of  network.  For  any  processor,  fast  memory  is  the 
memory  it  owns,  and  Reads  and  Writes  refer  to  moving  data  over  the  network  from  or  to  the  other  processors’ 
memories.  Recalling  notation  from  above,  consider  an  {M, . . . ,  M}  fit,  legal  parallel  execution  (£{, . . . ,  Ep), 
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i.e. ,  a  set  of  M— fit,  legal  sequential  executions  Ei  (with  corresponding  Z,  C  Z);  assuming  there  are  no  paired 
Allocate/Frees,  we  can  apply  Theorem  4.1  to  each  and  obtain  the  lower  bound  of  fl(|Z,;|/MSHBL_1)  words 
moved.  As  argued  in  [3],  to  minimize  computational  costs,  necessarily  at  least  one  of  the  processors  performs 
\Z\/P  (distinct)  iterations;  since  a  lower  bound  for  this  processor  is  a  lower  bound  for  the  critical  path,  we 
take  |Z,|  =  \Z\/P,  and  obtain  the  lower  bound  of  f2(|Z|/(PMSHBL~1))  words  moved  (along  the  critical  path). 

We  recall  that  Theorem  4.1  makes  an  asymptotic  assumption  on  |Z|;  having  replaced  \Z\  by  |Z|/P  in  the 
parallel  case,  this  assumption  becomes  |Z|/P  =  When  this  assumption  fails  (e.g.,  \Z\  is  small  or 

P  is  large),  it  may  be  possible  for  each  processor  to  store  its  entire  working  set  locally,  with  no  communi¬ 
cation  needed.  As  in  the  sequential  case  (see  above),  one  may  obtain  a  memory- independent  lower  bound 
W  =  (|Z|/P)1/SHBL  —  (/ TO),  which  may  be  tighter  in  this  regime;  this  result  generalizes  [3,  Lemma  2.3].  In¬ 
terestingly,  one  can  show  that  under  certain  assumptions  on  I  and  O  (e.g.,  only  one  copy  of  the  inputs/outputs 
is  permitted  at  the  beginning/end),  while  the  communication  cost  continues  to  decrease  with  P  (up  to  the 
natural  limit  P  =  |Z|),  it  fails  to  strong-scale  perfectly  when  the  assumption  on  |Z|/P  breaks;  e.g.,  see  [3, 
Corollary  3.2].  We  will  discuss  strong  scaling  limits  further  in  Part  2  of  this  work. 

One  may  also  ask  what  value  of  M  to  use  for  each  processor.  Suppose  that  each  processor  has  Mmax  words  of 
fast  memory,  and  that  the  total  problem  size  of  all  the  array  entries  accessed  is  Marray.  So  if  each  processor 
gets  an  equal  share  of  the  data  we  use  M  =  Marray/ P  <  Mmax.  But  the  lower  bound  may  still  apply, 
and  be  smaller,  if  M  is  larger  than  Marray/P  (but  at  most  Mmax).  In  some  cases  algorithms  are  known 
that  attain  these  smaller  lower  bounds  (e.g.,  matrix  multiplication  in  [20]),  i.e.,  replicating  data  can  reduce 
communication. 

In  Section  7.3,  we  discuss  attainability  of  these  parallel  lower  bounds,  and  reducing  communication  by  repli¬ 
cating  data. 

Hierarchical  parallel  processors.  The  simplest  possible  hierarchical  machine  is  the  sequential  one  with  multiple 
levels  of  memory  discussed  above.  But  real  parallel  machines  are  similar:  each  processor  has  its  own  memory 
organized  in  a  hierarchy.  So  just  as  we  applied  our  lower  bound  to  measure  memory  traffic  between  levels  i 
and  i  T  1  of  cache  on  a  sequential  machine,  we  can  similarly  analyze  the  memory  hierarchy  on  each  processor 
in  a  parallel  machine. 

Heterogeneous  machines.  Finally,  people  are  building  heterogeneous  parallel  machines,  where  the  various  pro¬ 
cessors,  memories,  and  interconnects  can  have  different  speeds  or  sizes.  Since  minimizing  the  total  running 
time  means  minimizing  the  time  when  the  last  processor  finishes,  it  may  no  longer  make  sense  to  assign  an 
equal  fraction  |Z|/P  of  the  work  and  equal  subset  of  memory  M  to  each  processor.  Since  our  lower  bounds 
apply  to  each  processor  independently,  they  can  be  used  to  formulate  an  optimization  problem  that  will  give 
the  optimal  amount  of  work  to  assign  to  each  processor  [2], 


5  (Un)decidability  of  the  communication  lower  bound 

In  Section  3,  we  proved  Theorem  3.2,  which  tells  us  that  the  exponents  s  €  [0,  l]m  satisfy  the  inequalities  (3.1),  i.e., 

m 

rank(P)  <  Sj  i'&nk((j)j(H))  for  all  subgroups  H  <  Zd, 

3=1 

precisely  when  the  desired  bound  (3.2)  holds.  If  so,  then  following  Theorem  4.1,  the  sum  of  these  exponents 
shbl  :=  J2jLi  si  leads  to  a  communication  lower  bound  of  n(|Z|/MSHBL_1).  Since  our  goal  is  to  get  the  tightest 
bound,  we  want  to  minimize  sj  subject  to  (3.1)  and  s  £  [0,  l]m.  In  this  section,  we  will  discuss  computing 

the  set  of  inequalities  (3.1),  in  order  to  write  down  and  solve  this  minimization  problem. 

We  recall  from  Section  3.3  that  the  feasible  region  for  s  (defined  by  these  inequalities)  is  a  convex  polytope 
V  C  [0,  l]m  with  finitely  many  extreme  points.  While  V  is  uniquely  determined  by  its  extreme  points,  there  may 
be  many  sets  of  inequalities  which  specify  P;  thus,  it  suffices  to  compute  any  such  set  of  inequalities,  rather  than 
the  specific  set  (3.1).  This  distinction  is  important  in  the  following  discussion.  In  Section  5.1,  we  show  that  there 
is  an  effective  algorithm  to  determine  V .  However,  it  is  not  known  whether  it  is  decidable  to  compute  the  set  of 
inequalities  (3.1)  which  define  V  (see  Section  5.2).  In  Section  5.3,  we  discuss  two  approaches  for  approximating  V, 
providing  upper  and  lower  bounds  on  the  desired  shbl- 
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5.1  An  Algorithm  Which  Computes  V 

We  have  already  shown  in  Lemma  3.14  that  the  polytope  V  is  unchanged  when  we  embed  the  groups  7Ld  and  Zd> 
into  the  vector  spaces  Qd  and  Qdj  and  consider  the  homomorphisms  4>j  as  Q-linear  maps.  Thus  it  suffices  to 
compute  the  polytope  V  corresponding  to  the  inequalities 

m 

dim(C)  <  ^  Sj  dim((/)j(Vr))  for  all  subgroups  V  <  Qd. 

3= 1 

Indeed,  combined  with  the  constraints  s  £  [0,  l]m,  this  is  the  hypothesis  (3.8)  of  Theorem  3.10  in  the  case  F  =  Q. 

We  will  show  how  to  compute  V  in  the  case  F  =  Q;  for  the  remainder  of  this  section,  V  and  Vj  denote  finite¬ 
dimensional  vector  spaces  over  Q,  and  <f)j  denotes  a  Q-linear  map.  We  note  that  the  same  reasoning  applies  to  any 
countable  field  F,  provided  that  elements  of  F  and  the  field  operations  are  computable. 

Theorem  5.1.  There  exists  an  algorithm  which  takes  as  input  any  vector  space  HBL  datum  over  the  rationals, 
i.e.,  F  =  Q,  and  returns  as  output  both  a  list  of  finitely  many  linear  inequalities  over  Z  which  jointly  specify  the 
associated  polytope  V(V,  {Vj),  (c pj )),  and  a  list  of  all  extreme  points  ofV{V,  {Vj),  {(pj))- 

Remark  5.2.  The  algorithm  we  describe  below  to  prove  Theorem  5.1  is  highly  inefficient,  because  of  its  reliance 
on  a  search  of  arbitrary  subspaces  ofV.  A  similar  algorithm  was  sketched  in  [25]  for  computing  the  polytope  in  [6, 
Theorem  2.1],  a  result  related  to  Theorem  3.10  in  the  case  F  =  R.  but  where  the  vector  spaces  V,  Vj  are  endowed  with 
additional  inner-product  structure.  That  algorithm  searches  a  smaller  collection  of  subspaces,  namely  the  lattice 
generated  by  the  nullspaces  of  cp i, . . . ,  (pm-  In  Part  2  of  this  work  (see  Section  8)  we  will  show  that  it  suffices  to 
search  this  same  lattice  in  our  case;  this  may  significantly  improve  the  efficiency  of  our  approach.  Moreover,  this 
modification  will  also  allow  us  to  relax  the  requirement  that  F  is  countable,  although  this  is  not  necessary  for  our 
application. 

The  proof  of  Theorem  5.1  is  built  upon  several  smaller  results. 

Lemma  5.3.  There  exists  an  algorithm  which  takes  as  input  a  finite- dimensional  vector  space  V  over  Q,  and 
returns  a  list  of  its  subspaces.  More  precisely,  this  algorithm  takes  as  input  a  finite- dimensional  vector  space  V  and 
a  positive  integer  N,  and  returns  as  output  the  first  N  elements  Wi  of  a  list  (Wi,  W2, . . .)  of  all  subspaces  of  V. 
This  list  is  independent  of  N.  Each  subspace  W  is  expressed  as  a  finite  sequence  (d;w  1, . . .  ,Wd)  where  d  =  dim(lT) 
and  {w,;}  is  a  basis  for  W. 

Proof.  Generate  a  list  of  all  nonempty  subsets  of  V  having  at  most  dirn(C)  elements.  Test  each  subset  for  linear 
independence,  and  discard  all  which  fail  to  be  independent.  Output  a  list  of  those  which  remain.  □ 

We  do  not  require  this  list  to  be  free  of  redundancies. 

Lemma  5.4.  For  any  positive  integer  m,  there  exists  an  algorithm  which  takes  as  input  a  finite  set  of  linear 
inequalities  over  Z  for  s  £  [0,  l]m,  and  returns  as  output  a  list  of  all  the  extreme  points  of  the  convex  subset 
V  C  [0,1]-  specified  by  these  inequalities. 

Proof.  To  the  given  family  of  inequalities,  adjoin  the  2 m  inequalities  Sj  >  0  and  —  Sj  >  —1.  V  is  the  convex 
polytope  defined  by  all  inequalities  in  the  resulting  enlarged  family.  Express  these  inequalities  as  (s,va)  >  ca  for 
all  a  £  A,  where  A  is  a  finite  nonempty  index  set. 

An  arbitrary  point  r  £  Rm  is  an  extreme  point  of  V  if  and  only  if  (i)  there  exists  a  set  B  of  indices  a  having 
cardinality  m,  such  that  {vp  :  (3  £  B}  is  linearly  independent  and  (r,vp)  =  cp  for  all  (3  £  B,  and  (ii)  r  satisfies 
(t,  va)  >  ca  for  all  a  £  A. 

Create  a  list  of  all  subsets  B  C  A  with  cardinality  equal  to  m.  There  are  finitely  many  such  sets,  since  A  itself  is 
finite.  Delete  each  one  for  which  [vp  :  (3  £  B}  is  not  linearly  independent.  For  each  subset  B  not  deleted,  compute 
the  unique  solution  r  of  the  system  of  equations  (r,vp)  =  cp  for  all  (3  £  B.  Include  r  in  the  list  of  all  extreme 
points,  if  and  only  if  r  satisfies  (r,  va)  >  ca  for  all  a  £  A  \  B.  □ 

Proposition  5.5.  There  exists  an  algorithm  which  takes  as  input  a  vector  space  HBL  datum  {V,{Vj),{<f>j)),  an 
element  t  £  [0,  l]m,  and  a  subspace  {0}  <  W  <  V  which  is  critical  with  respect  to  t,  and  determines  whether 
t£V{v,(Vj)Mj)). 
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Theorem  5.1  and  Proposition  5.5  will  be  proved  inductively  in  tandem,  according  to  the  following  induction 
scheme.  The  proof  of  Theorem  5.1  for  HBL  data  in  which  V  has  dimension  n,  will  rely  on  Proposition  5.5  for  HBL 
data  in  which  V  has  dimension  n.  The  proof  of  Proposition  5.5  for  HBL  data  in  which  V  has  dimension  n  and  there 
are  m  subspaces  Vj,  will  rely  on  Proposition  5.5  for  HBL  data  in  which  V  has  dimension  strictly  less  than  n,  on 
Theorem  5.1  for  HBL  data  in  which  V  has  dimension  strictly  less  than  n,  and  also  on  Theorem  5.1  for  HBL  data 
in  which  V  has  dimension  n  and  the  number  of  subspaces  Vj  is  strictly  less  than  in.  Thus  there  is  no  circularity  in 
the  reasoning. 

Proof  of  Proposition  5.5.  Let  ( V ,  (Vj),  (<t>j))  and  t ,  W  be  given.  Following  Notation  3.23,  consider  the  two  HBL  data 
(W,  (fa(W)),  ( (j>j\w ))  and  ( V /W ,  (Vj/fa(W)),  ([0j])),  where  [<j)j\ :  V /W  — >  Vj/<j)j(W )  are  the  quotient  maps.  From 
a  basis  for  V,  bases  for  Vj,  a  basis  for  W,  and  corresponding  matrix  representations  of  <pj,  it  is  possible  to  compute 
the  dimensions  of,  and  bases  for,  V /W  and  Vj / 'fa(W),  via  row  operations  on  matrices.  According  to  Lemma  3.25, 
t  G  V(V,  (Vj),  (fa))  if  and  only  if  t  G  V(W,  (fa(W)),  (<j>j\w))  C  V(V/W,  (Vj/fa(W)),  ([fa])). 

Because  0  <  W  <  V,  both  IF,  V/W  have  dimensions  strictly  less  than  the  dimension  of  V.  Therefore  by 
Theorem  5.1  and  the  induction  scheme,  there  exists  an  algorithm  which  computes  both  a  finite  list  of  inequalities 
characterizing  V(W,(fa(W)),(fa\w)),  and  a  finite  list  of  inequalities  characterizing  V(V/W,(Vj/<l>j(W)),([<f>j])). 
Testing  each  of  these  inequalities  on  t  determines  whether  t  belongs  to  these  two  polytopes,  hence  whether  t 
belongs  to  V(V,  (Vj),  (<j>j))-  □ 

Lemma  5.6.  Let  HBL  datum  (V,  (Vj),  (fa))  be  given.  Let  i  G  {1,2,...,  m}.  Let  s  G  [0,  l]m  and  suppose  that  Sj  =  1. 
Let  V'  =  {x  G  V  :  <t>i(x)  =  0}  be  the  nullspace  of  fa.  Define  s'  G  [0,  l]m_1  to  be  (s i, . . . ,  sm)  with  the  ith  coordinate 
deleted.  Then  s  G  V(V,  (Vj),  (<f>j))  if  and  only  ifs£  V(V',  (Vj)j^j,  (fa\v,)jjti)- 

Proof.  For  any  subspace  W  <  V' ,  since  dim (fa(W))  =  0, 

si  dim(<£j(JF))  =  H  di M<t>j(w))- 

3  j^i 

So  if  s  G  V(V,  (Vj),  (fa))  then  s  G  V(V' ,  (Vj)j^,  (fa\v,)j&). 

Conversely,  suppose  that  s  G  V(V' ,  (Vj)j7a,  (fa\v,)j&)-  Let  W  be  any  subspace  of  V.  Write  W  =  W"  +  (WC\V') 
where  the  subspace  W"  <  V  is  a  supplement  to  W  fl  V'  in  W,  so  that  dim  (IF)  =  dim  (W")  +  dim  (IF  n  V').  Then 

s3  dim(^(!F))  =  dim (fa(W))  +  ^  s-j  dim (fa(W)) 

3  3^i 

>  dim((/>i(lF,,))  +  ^  Sj  dim (fa(W  fl  V')) 

>  dim(TF")  +  dim (W  fl  V')\ 

dim (fa(W"))  =  dim(lF")  because  fa  is  injective  on  W" .  So  s  G  V(V,  (Vj),  (fa)).  □ 

To  prepare  for  the  proof  of  Theorem  5.1,  let  P(V,(Vj),(<j)j))  be  given.  Let  (Wi,W2,W3, . . .)  be  the  list  of 
subspaces  of  V  produced  by  the  algorithm  of  Lemma  5.3.  Let  N  >  1.  To  each  index  a  G  {1,2,...,  N}  is  associated 
a  linear  inequality  J2JL i  sj  dim(^(lFQ))  >  clim(lFQ)  for  elements  s  G  [0,  l]m,  which  we  encode  by  an  (m  +  l)-tuple 
(v(Wa),c(Wa))-,  the  inequality  is  (s,u(!FQ))  >  c(Wa).  Define  Vn  C  [0,  l]m  to  be  the  polytope  dehned  by  this  set 
of  inequalities. 

Lemma  5.7. 

VN  D  V(V,  (Vj),  (fa))  for  all  N.  (5.1) 

Moreover,  there  exists  a  positive  integer  N  such  that  Vm  =  V(V,  (Vj),  (fa))  for  all  M  >  N. 

Proof.  The  inclusion  holds  for  every  N,  because  the  set  of  inequalities  defining  Vn  is  a  subset  of  the  set  defining 
V(V,(V3),(fa)). 

V(V,  (Vj),  (fa))  is  specified  by  some  finite  set  of  inequalities,  each  specified  by  some  subspace  of  V.  Choose  one 
such  subspace  for  each  of  these  inequalities.  Since  ( IFQ )  is  a  list  of  all  subspaces  of  V,  there  exists  M  such  that 
each  of  these  chosen  subspaces  belongs  to  (Wa  :  a  <  M).  □ 

Lemma  5.8.  Let  m  >2.  If  s  is  an  extreme  point  of  Vn,  then  either  Sj  G  {0, 1}  for  some  j  G  {1,2,...,  m},  or 
there  exists  a  £  {1,2,...,  N}  for  which  Wa  is  critical  with  respect  to  s  and  0  <  dim(!FQ)  <  dim(F). 
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In  the  following  argument,  we  say  that  two  inequalities  (s,  v{Wa))  >  c(Wa),  ( s,v(Wp ))  >  c(Wp)  are  distinct  if 
they  specify  different  subsets  of  Rm . 

Proof.  For  any  extreme  point  s,  equality  must  hold  in  at  least  m  genuinely  distinct  inequalities  among  those  defining 
Vn ■  These  inequalities  are  of  three  kinds:  (s,t;(WQ))  >  c(Wa)  for  a  £  {1,  2, . . . ,  N},  Sj  >  0,  and  —  Sj  >  —1,  with 
j  £  {1,2,...,  m}.  If  IV ,3  =  {0}  then  Wp  specifies  the  tautologous  inequality  fT/  Sj  ■  0  =  0,  so  that  index  /3  can  be 
disregarded. 

If  none  of  the  coordinates  Sj  are  equal  to  0  or  1,  there  must  exist  /3  such  that  equality  holds  in  at  least  two 
distinct  inequalities  ( s,v(Wp ))  >  c(Wp)  associated  to  subspaces  Wa  among  those  which  are  used  to  define  Vn-  We 
have  already  discarded  the  subspace  {0},  so  there  must  exist  (3  such  that  Wp  and  V  specify  distinct  inequalities. 
Thus  0  <  dim (Wp)  <  dim(V).  □ 

Proof  of  Theorem  5.1.  Set  V  =  V(V,  (Vj),  {</>j))-  Consider  first  the  base  case  m  =  1.  The  datum  is  a  pair  of 
finite-dimensional  vector  spaces  V,  V\  with  a  Q-linear  map  (j>:  V  — >  V\.  The  polytope  V  is  the  set  of  all  s  €  [0, 1] 
for  which  sdim(^(IT))  >  dim(W)  for  every  subspace  W  <  V.  If  clim(V')  =  0  then  V (V,  (Vj) ,  (cf>j))  =  [0,1].  If 
clim(V')  >  0  then  since  dim(0(IT))  <  dim(IT)  for  every  subspace,  the  inequality  can  only  hold  if  the  nullspace  of 
(f)  has  dimension  0,  and  then  only  for  s  =  1.  The  nullspace  of  <f>  can  be  computed.  So  V  can  be  computed  when 
m  =  1. 

Suppose  that  to  >  2.  Let  ( V ,  (Vj),  (4>j))  be  given.  Let  N  =  0.  Recursively  apply  the  following  procedure. 
Replace  N  by  N  +  1.  Consider  Vn-  Apply  Lemma  5.4  to  obtain  a  list  of  all  extreme  points  r  of  Vn,  and  for 
each  such  r  which  belongs  to  (0,  l)m,  a  nonzero  proper  subspace  W(t)  <  V  which  is  critical  with  respect  to  r. 

Examine  each  of  these  extreme  points  r,  to  determine  whether  r  £  V(V,  (Vj),  (4>j))-  There  are  three  cases. 
Firstly,  if  r  €  (0,  l)m,  then  Proposition  5.5  may  be  invoked,  using  the  critical  subspace  W(t),  to  determine  whether 
T  £  V. 

Secondly,  if  some  component  r,  of  r  equals  1,  let  V'  be  the  nullspace  of  </>j.  Set 

V  =  V(V',(Vj)j7ii,(<i>j\v,)j?i). 

According  to  Lemma  5.6,  r  £  V  if  and  only  if  t  =  (rj)jjti  £  V' .  This  polytope  V'  can  be  computed  by  the  induction 
hypothesis,  since  the  number  of  indices  j  has  been  reduced  by  one. 

Finally,  if  some  component  r,;  of  r  equals  0,  then  because  the  term  Si  dim(</)i(W))  =  0  contributes  nothing  to 
sums  sj  dim((/)j(W/)),  r  G  V  if  and  only  iff  belongs  to  V(V,  (Vj)j^i,  (<j>j  )j^i)-  To  determine  whether  f  belongs 
to  this  polytope  requires  again  only  an  application  of  the  induction  hypothesis. 

If  every  extreme  point  r  of  Vn  belongs  to  V,  then  because  Vn  is  the  convex  hull  of  its  extreme  points,  Vn  Q  V. 
The  converse  inclusion  holds  for  every  N,  so  in  this  case  Vn  =  V.  The  algorithm  halts,  and  returns  the  conclusion 
that  V  =  Vn,  along  with  information  already  computed:  a  list  of  the  inequalities  specified  by  all  the  subspaces 
IFi , . . . ,  Wn,  and  a  list  of  extreme  points  of  Vn  =  V. 

On  the  other  hand,  if  at  least  one  extreme  point  of  Vn  fails  to  belong  to  V,  then  Vn  7^  V.  Then  increment  N 
by  one,  and  repeat  the  above  steps. 

Lemma  5.7  guarantees  that  this  procedure  will  halt  after  finitely  many  steps.  □ 

5.2  On  Computation  of  the  Constraints  Defining  V 

In  order  to  compute  the  set  of  inequalities  (3.1),  we  would  like  to  answer  the  following  question:  Given  any  group 
homomorphisms  <j>  1, ... ,  4>m  and  integers  0  <  r,  rq, _ ,rm  <  d,  does  there  exist  a  subgroup  H  <  7Ld  such  that 

rank (H)  =  r,  rank(^i(F))  =  n,  . . . ,  rank(<j =  rm 


or,  in  other  words,  is 

m 

r  <  X!  si  ■  r:i 

i=i 

one  of  the  inequalities?  Based  on  the  following  result,  it  is  not  known  whether  this  problem  is  decidable  for  general 

(Z  “.(Z^  ),(&))• 

Theorem  5.9.  There  exists  an  effective  algorithm  for  computing  the  set  of  constraints  (3.1)  defining  V  if  and  only 
if  there  exists  an  effective  algorithm  to  decide  whether  a  system  of  polynomial  equations  with  rational  coefficients 
has  a  rational  solution. 
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The  conclusion  of  Theorem  5.9  is  equivalent  to  a  positive  answer  to  Hilbert’s  Tenth  Problem  for  the  rational  numbers 
Q  (see  Definition  5.14),  a  longstanding  open  problem. 

Let  us  fix  some  notation. 

Notation  5.10  (see,  e.g.,  [15]).  For  a  natural  number  d  and  ring  R,  we  write  Md{R)  to  denote  the  ring  of  d  by-d 
matrices  with  entries  from  R.  (Note  that  elsewhere  in  this  work  we  also  use  the  notation  Rmxn  to  denote  the  set  of 
m-by-n  matrices  with  entries  from  R.)  We  identify  Md(R)  with  the  endomorphism  ring  of  the  R-module  Rd  and 
thus  may  write  elements  of  Md(R)  as  R-linear  maps  rather  than  as  matrices.  Via  the  usual  coordinates,  we  may 
identify  Md{R)  with  Rd  .  We  write  R[x  i, . . .  ,xq\  to  denote  the  ring  of  polynomials  over  R  in  variables  x\,...,xq. 

Recall  we  are  given  d,  dj  £  N  and  Z-linear  maps  <f>j  :  Zd  — \  Zdj ,  for  j  £  {1,2,...,  m}  for  some  positive  integer  m. 
Without  loss,  we  may  assume  each  dj  =  d,  so  each  f>j  is  an  endomorphism  ofZd.  Each  4>j  can  also  be  interpreted 
as  a  Q-linear  map  (from  Qd  to  Qdj ),  represented  by  the  same  integer  matrix. 

Definition  5.11.  Given  m,d  £  N,  and  a  finite  sequence  r,r±, . . .  ,rm  of  natural  numbers  each  bounded  by  d,  we 
define  the  sets 

Ed;r,r!,...,rm  '■=  {(0i,  •  •  • ,  4>m)  £  {Md{Z))m  :  (3 H  <  Zd)  rank (H)  =  r  and  rank(0j(.ff))  =  rj,  1  <  j  <  m}, 
E%r,n,..,rm  '■=  •  •  •  >4>m)  e  (Md(  Q))m  :  (3V  <  Qd)  dim(P)  =  r  and  dim(0,(F))  =  rjt  1  <j<  m}. 

Remark  5.12.  The  question  of  whether  a  given  m-tuple  (0i, . . . ,  (f>m)  G  ( Md(R))m  is  a  member  of  Ed-r,r1,...,rm 
(when  R  =  Z)  or  E^.r  r  (when  R  =  Qj  is  an  instance  of  the  problem  of  whether  some  system  of  polynomial 
equations  has  a  solution  over  the  ring  R.  We  let  B  be  a  d-by^r  matrix  of  variables,  and  construct  a  system  of 
polynomial  equations  in  the  dr  unknown  entries  of  B  and  md2  known  entries  of  <fi, ... ,  0m  that  has  a  solution  if 
and  only  if  the  aforementioned  rank  (or  dimension)  conditions  are  met.  The  condition  rank(Af)  =  s  for  a  matrix 
M  is  equivalent  to  all  ( s  +  1  )-by-(s  +  1)  minors  of  M  equaling  zero  (i.e.,  the  sum  of  their  squares  equaling  zero), 
and  at  least  one  s-by-s  minor  being  nonzero  (i.e.,  the  sum  of  their  squares  not  equaling  zero  —  see  Remark  5.15). 
We  construct  two  polynomial  equations  in  this  manner  for  M  =  B  (with  s  =  r)  and  for  each  matrix  M  =  cf>jB 
(with  s  =  rj ). 

Lemma  5.13.  With  the  notation  as  in  Definition  5.11,  Ed-r,r1,...,rm  =  Ej.rri  rm  ^  (Md{Z,))m. 

Proof.  This  result  was  already  established  in  Lemma  3.14;  we  restate  it  here  using  the  present  notation.  For  the 
left-to-right  inclusion,  observe  that  if  H  <  Zd  witnesses  that  (0i, . . . ,  (j)m)  £  Ed;s,r i,...,rTO,  then  Hq  witnesses  that 
(0i, . . . ,  0m)  £  E®rtri_rm.  For  the  other  inclusion,  if  (01; . . . ,  (j>m)  £  witnessed  by  V  <  Qd , 

then  we  may  find  a  subgroup  H  =  V  D  Zd  of  Zd ,  with  rank(U)  =  clim(P).  Then  dim(0j(H))  =  rank (cj>j(H))  =  rj 
showing  that  (0 1, . . . ,  0m)  £  Ed;r,r  1,...,rm-  □ 

Definition  5.14.  Hilbert’s  Tenth  Problem  for  Q  is  the  question  of  whether  there  is  an  algorithm  which  given  a  finite 
set  of  polynomials  fi(x\, . . . ,  xq), . . . ,  fp(x i, . . . ,  xq )  €  Q[xi, . . . ,  xq]  (correctly)  determines  whether  or  not  there  is 
some  a  £  Qq  for  which  fi(a)  =  ■  ■  ■  =  fp(a)  =  0. 

Remark  5.15.  One  may  modify  the  presentation  of  Hilbert  ’s  Tenth  Problem  for  Q  in  various  ways  without  affecting 
its  truth  value.  For  example,  one  may  allow  a  condition  of  the  form  g  (a)  0  0  as  this  is  equivalent  to  (3b)(g(a)b—l  = 
0).  On  the  other  hand,  using  the  fact  that  x2  +  y2  =  0  x  =  0  =  y,  one  may  replace  the  finite  sequence  of 
polynomial  equations  with  a  single  equality  (see  also  Remark  5.18). 

Remark  5.16.  Hilbert’s  Tenth  Problem,  proper,  asks  for  an  algorithm  to  determine  solvability  in  integers  of  finite 
systems  of  equations  over  Z.  From  such  an  algorithm  one  could  positively  resolve  Hilbert’s  Tenth  Problem  for  Q. 
However,  by  the  celebrated  theorem  of  Matiyasevich-Davis-Putnam-Robinson  [17],  no  such  algorithm  exists.  The 
problem  for  the  rationals  remains  open.  The  most  natural  approach  would  be  to  reduce  from  the  problem  over  Q  to 
the  problem  over  Z,  say,  by  showing  that  Z  may  be  defined  by  an  expression  of  the  form 

a£  Z  (3  yi)  ■  ■  ■  (3  yq)P(a;  y1,...,yq)  =  0 

for  some  fixed  polynomial  P.  Konigsmann  [If]  has  shown  that  there  is  in  fact  a  universal  definition  ofZ  in  Q,  that 
is,  a  formula  of  the  form 

a  £  Z  *£=>■  (Vyr)  •  •  •  (\/yq)0(a;  yi,.. .  ,yq)  =  0 

where  9  is  a  finite  Boolean  combination  of  polynomial  equations,  but  he  also  demonstrated  that  the  existence  of  an 
existential  definition  ofZ  would  violate  the  Bombieri-Lang  conjecture.  Konigsmann’ s  result  shows  that  it  is  unlikely 
that  Hilbert’s  Tenth  Problem  for  Q  can  be  resolved  by  reducing  to  the  problem  over  Z  using  an  existential  definition 
ofZ  in  Q.  However,  it  is  conceivable  that  this  problem  could  be  resolved  without  such  a  definition. 
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Proof  of  Theorem  5.9  (necessity).  Evidently,  if  Hilbert’s  Tenth  Problem  for  Q  has  a  positive  solution,  then  there  is 
an  algorithm  to  (correctly)  determine  for  d  €  N,  r,  rq, . . . ,  rm  <  d  also  in  N,  and  (0i, . . . ,  0m)  G  {Md{ Z))m  whether 
(01  ,...,(j)m)  G  £’d;r,r1,...,rm-  By  Lemma  5.13,  (0 1,...,<j>m)  G  Ed.r>r just  in  case  (0i,...,0m)  G 
This  last  condition  leads  to  an  instance  of  Hilbert’s  Tenth  Problem  (for  Q)  for  the  set  of  rational  polynomial 
equations  given  in  Remark  5.12.  □ 

Notation  5.17.  Given  a  set  S  C  Q[xi, . . . ,  xq]  of  polynomials,  we  denote  the  set  of  rational  solutions  to  the 
equations  f  =  0  as  f  ranges  through  S  by 

V(S)( Q)  :=  {a  G  Q9  :  (V/  G  S)f(a)  =  0}. 

Remark  5.18.  Since  V(S)(Q)  is  an  algebraic  set  over  a  field,  Hilbert’s  basis  theorem  shows  that  we  may  always 
take  S  to  be  finite.  Then  by  replacing  S  with  S'  :=  {J2jesf2}’  one  sees  ^at  S  may  be  assumed  to  consist  of  a 
single  polynomial.  While  the  reduction  to  finite  S  is  relevant  to  our  argument,  the  reduction  to  a  single  equation  is 
not. 

Definition  5.19.  For  any  natural  number  t,  we  say  that  the  set  D  C  Q4  is  Diophantine  if  there  is  some  q  —  t  G  N 
and  a  set  S  C  Q[xi, ...  ,Xt.;yi, ... ,  yq-t]  for  which 

D  =  {a  G  Q4  :  {3b  G  Q^Xa;  b )  G  F(5)(Q)}. 

We  will  show  sufficiency  in  Theorem  5.9  by  establishing  a  stronger  result:  namely,  that  an  algorithm  to  decide 
membership  in  sets  of  the  form  -E0;r,r-i,...,rm  could  also  be  used  to  decide  membership  in  any  Diophantine  set. 
(Hilbert’s  Tenth  Problem  for  Q  concerns  membership  in  the  specific  Diophantine  set  V(5)(Q).) 

With  the  next  lemma,  we  use  a  standard  trick  of  replacing  composite  terms  with  single  applications  of  the  basic 
operations  to  put  a  general  Diophantine  set  in  a  standard  form  (see,  e.g.,  [24]). 

Lemma  5.20.  Given  any  finite  set  of  polynomials  S  C  Q[aq,  ...,xq\,  let  d  :=  max/6smax|=1  deg^.  (/)  and  := 
{0, 1, . . . ,  d}q .  There  is  another  set  of  variables  {wajaeD  and  another  finite  set  S'  C  Q[{uQ}ae'D]  consisting  entirely 
of  affine  polynomials  (polynomials  of  the  form  c  +  Y)  where  not  all  c  and  ca  are  zero)  and  polynomials  of  the 
form  uaup  —  rt7  with  a,  (3,  and  7  distinct,  so  that  V(S)(Q)  =  7r(y(S',)(Q))  where  ir:  Qv  -7  Q9  is  given  by 

(ua)aeV  |-t  (w(i,o,...,0))  u(0,l,0,...,0);  •  •  •  >  u(o,...,o,i))- 
Proof.  Let  T  C  Q[{ita}Qex>]  consist  of 

•  W(o,...,o)  ~  1  and 

•  ua+p  —  uaup  for  (a  +  (3)  G  T>,  a  7^  0  and  (3  ^  0. 

Define  \  '■  Q9  ~ ►  Q13  by 

(xi, . . .  ,xq)  (xa)a£T> 

where  xa  :=  JJj-iXj3.  One  sees  immediately  that  \  induces  a  bijection  y:  Qq  — >  V(T)( Q)  with  inverse  ^ | v'(r) (Q> • 
Let  S'  be  the  set  containing  T  and  the  polynomials  caua  for  which  caXa  G  S. 

One  checks  that  if  a  G  Qq,  then  y(a)  G  V(S')(Q)  if  and  only  if  a  G  Vr(S')(Q).  Applying  7 r,  and  noting  that  ir  is 
the  inverse  to  y  on  V(T)(Q),  the  result  follows.  □ 

Notation  5.21.  For  the  remainder  of  this  argument,  we  call  a  set  enjoying  the  properties  identified  for  S'  (namely 
that  each  polynomial  is  either  affine  or  of  the  form  uaup  —  u7)  a  basic  set. 

Proof  of  Theorem  5.9  (sufficiency).  It  follows  from  Remark  5.18  and  Lemma  5.20  that  the  membership  problem 
for  a  general  Diophantine  set  may  be  reduced  to  the  membership  problem  for  a  Diophantine  set  defined  by  a  finite 
basic  set  of  equations.  Let  SCQjij,...,  xq\  be  a  finite  basic  set  of  equations  and  let  t  <  q  be  some  natural  number, 
We  now  show  that  there  are  natural  numbers  p,  17  p,  pi, . . . ,  and  a  computable  function  /:  Q4  — >  (Mt/(Z)),t  so 
that  for  a  G  Q4  one  has  that  there  is  some  b  G  Q9-4  with  (a,  b)  G  R(S')(Q)  if  and  only  if  /(a)  G  Ev.PtPx%__^Pii. 

List  the  i  affine  polynomials  in  S  as 

9  q 

■^0,1  +  ^i,lXj,  •••,  Atyi  +  A  j^Xj 

i— 1  i= 1 
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and  the  k  polynomials  expressing  multiplicative  relations  in  S  as 

rp  .  rp  .  /p  .  np  .  rp  .  rp  . 

*t/fcl,lU'*2,l  *^*3,15  **•)  ^l.fc  %hl2,k 

Note  that  by  scaling,  we  may  assume  that  all  of  the  coefficients  A,j  are  integers. 

We  shall  take  p  :=  4  +  q  +  t  +  |S|,  v  :=  2q  +  2,  p  :=  2  and  the  sequence  p\,...,pp  to  consist  of  4  +  q  +  k  ones 
followed  by  t  +  £  zeros.  Let  us  describe  the  map  /:  Q*  — >  (M„(Z))^  by  expressing  each  coordinate.  For  the  sake  of 
notation,  our  coordinates  on  Q"  are  (u;  v)  :=  (uq,  U\ . . . ,  uq;  Vq,  tq, . . . ,  vq). 

A.  The  map  fi  is  constant  taking  the  value  (u;  v )  e+  (u0, 0, . . . ,  0). 

B.  The  map  /2  is  constant  taking  the  value  (w;  v )  e+  (u0,  0, . . . ,  0). 

C.  The  map  f3  is  constant  taking  the  value  (u;  v )  e+  (w0, . . . ,  uq;  0, . . . ,  0). 

D.  The  map  /4  is  constant  taking  the  value  (u;  v)  e+  (u0, . . . ,  vq;  0, . . . ,  0). 

E.  The  map  f±+j  (for  0  <  j  <  q)  is  constant  taking  the  value  (u;  v)  H >  (uq  —  Vo,  Uj  —  Vj,  0, . . . ,  0). 

F.  The  map  fi+q+j  (for  0  <  j  <  k)  is  constant  taking  the  value  (u;  v)  K >  ( u ^  .  +  Uo,  rtj3  ^  +  Uj2  .,  0, , . . ,  0). 

G.  The  map  fi+q+k+j  (for  0  <  j  <  t)  takes  a  =  ((^,...,^)  (written  in  lowest  terms)  to  the  linear  map 
(«;  v)  (pjU0  -  qjUj,  0, . . . ,  0). 

H.  The  map  f±+q+k+t+j  (for  0  <  j  <  t)  takes  the  value  (u;  v)  e- >  o  ^ i,ju o  0, . . . ,  0). 

Note  that  only  the  components  /4+(?+fc+:,  for  0  <  j  <t  actually  depend  on  (oi, . . . ,  at)  €  Q* . 

Let  us  check  that  this  construction  works.  First,  suppose  that  a  €  Q*  and  that  there  is  some  b  €  Q9_t  for  which 
(a,  b)  £  V (S)( Q).  For  the  sake  of  notation,  we  write  c  =  (ci, . . . ,  cq)  :=  (ai, . . . ,  at,  b\, . . . ,  6g_t). 

Let  V  :=  Q(l,  Ci, . . . ,  cq,  0, . . . ,  0)  +  Q(0, . . . ,  0, 1,  ci, . . . ,  cq).  We  will  check  now  that  V  witnesses  that  /(a)  € 
Note  that  a  general  element  of  V  takes  the  form  (a,  aci, . . . ,  acq,  /3,  fic\, . . . ,  (3cq)  for  (a,/3)  G  Q2. 
Throughout  the  rest  of  this  proof,  when  we  speak  of  a  general  element  of  some  image  of  V,  we  shall  write  a  and  /? 
as  variables  over  Q. 

Visibly,  dim(V)  =2  =  p. 

Clearly,  fi(a)(V)  =  Q(1,0, . . . ,  0)  =  /2(a)(V),  so  that  px  =  dim(/!(a)(V))  =  1  =  dim(/2(a)(V))  =  p2  as 
required. 

Likewise,  /3(a)(V)  =  Q(l,  ci, . . . ,  cq,  0, . . . ,  0)  =  U(a)(V),  so  that  p3  =  dim (/3(a)(V))  =  1  =  dim(/4(a)(V))  = 
PA- 

For  0  <  j  <  q  the  general  element  of  /4+j(a)(V)  has  the  form  (a  —  /3,  aCj  —  /3cy ,  0, . . . ,  0)  =  (a  —  /3)(1,  Cj,  0, . . . ,  0). 
Thus,  f4+j(a)(V)  =  Q(l,  Cj,  0, . . . ,  0)  has  dimension  p4+J-  =  1. 

For  0  <  j  <  k  we  have  Cj3  .  =  Cj3  .Cj2  the  general  element  of  /4+g+j(a)(V)  has  the  form  (acix  .  +  /3,acj3j.  + 
/3ci2  j ,  0, . . . ,  0)  =  (i aci +  (3,  (u--,.  :  +  /3ci2  j ,  0, . . . ,  0)  =  [ac^  -  +  /3)(1,  ci2  j ,  0, . . . ,  0)  so  we  have  that  ,04+9+4  = 

dim(/4+g+j(a)(V))  =  1. 

For  0  <  j  <  t,  the  general  element  of  f4+q+k+j(a)(V)  has  the  form  ( pja  —  qjacij,  0, . . . ,  0)  =  0.  That  is, 
PA+q+k+j  =  dim(/4+g+fc+j(a)(V))  =  0. 

Finally,  if  0  <  j  <£,  then  the  general  element  of  /4+<3+fc+t+j(a)(V)  has  the  form  (Ao,ja+X+=i  A ijaci,  0, . . . ,  0)  = 
0  since  X0j  +  Ya= 1  \ici  =  So  PA+q+k+t+j  =  dim (/4+g+fe+t+i(a)(V))  =  0. 

Thus,  we  have  verified  that  if  a  €  Q4  and  there  is  some  b  €  Qd_t  with  (a,  b)  €  V(5)(Q),  then  /(a)  G  Eu-Ptplt...tPlx. 
Conversely,  suppose  that  a  =  (^, . . . ,  G  Q*  and  that  f(a)  G  El/-ptp1.r_,tPtil  with  the  same  p,  u,  p1  pi, . . . ,  pp. 
Let  V  <  Q1'  have  dirn(V)  =  p  witnessing  that  /(a)  G  Ev.fptP1^.tp  . 

Lemma  5.22.  There  are  elements  g  and  h  in  V  for  which  g  =  (go,  ■  ■  ■ ,  gq\  0, . . . ,  0),  h  =  (0, . . . ,  0;  ho,  ■ . . ,  hq), 
go  =  h0  =  1,  and  Qg  +  Qh=  V. 

Proof.  The  following  is  an  implementation  of  row  reduction.  Let  the  elements  d  =  (do,  ■  ■  ■  ,dq;d'0, . . .  ,d'q)  and 
e  =  (eo,  •  •  • ,  eq;  e'0, . . . ,  e'  )  be  a  basis  for  V.  Since  dim(/1(a)(V))  =  1,  at  the  cost  of  reversing  d  and  e  and 
multiplying  by  a  scalar,  we  may  assume  that  do  =  1.  Since  dim(/3(a)(V))  =  1,  we  may  find  a  scalar  7  for  which 

('ydo,  ■  ■  ■ ,  jdg)  =  (eo, . . . ,  eq).  Set  g  :=  e  —  yd.  Write  g  =  (0, . . . ,  0,  go,  ■  ■  ■ ,  gq).  Since  g  is  linearly  independent  from 

e  and  dim(/4(a)(V))  =  1,  we  see  that  there  is  some  scalar  6  for  which  (6  go, . . . ,  Sgq)  =  (d'0, ...  ,d'q).  Set  h  :=  d  —  Sg. 

Using  the  fact  that  dim(/2(a)(U))  =  1  we  see  that  go  ^  0.  Set  g  :=  g^g.  □ 
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Lemma  5.23.  For  0  <  j  <  q  we  have  gj  =  hj. 


Proof.  We  arranged  go  =  1  =  ho  in  Lemma  5.22.  The  general  element  of  V  has  the  form  ah  +  fig  for  some 
(a,  (3)  €  Q2.  For  0  <  j  <  q,  the  general  element  of  fi+j{a){V)  has  the  form  (■ aho  —  (3ho,ahj  —  (3gj,  0, . . . ,  0)  = 
((a  —  (3)ho,  (a  —  f3)hj  +  (3{hj  —  gj),  0, . . . ,  0).  Since  ho  ^  0,  if  hj  ^  gj,  then  this  vector  space  would  have  dimension 
two,  contrary  to  the  requirement  that  /(a)  G  Eu.tPtPlt__,tPfi.  □ 

Lemma  5.24.  For  0  <  j  <  t  we  have  hj  =  aj. 

Proof.  The  image  of  ah  +  fig  under  f±+q+k+j(a)  is  (jpjOt  —  qjahj,  0, ...  ,0)  where  aj  =  ^  in  lowest  terms.  Since 
dim(/4+9+fc+:)'(a)(t/))  =  0,  we  have  qjhj  =  Pj.  That  is,  hj  =  aj.  □ 


Lemma  5.25.  For  any  F  €  S,  we  have  F[h\, . . . ,  hq)  =  0. 


Proof.  If  F  is  an  affine  polynomial,  that  is,  if  F  =  Xqj  +  ^i,jxi  f°r  some  0  <  j  <  i,  then  because 
dim(/4+g+fc_|_t+;)  (o)(y))  =  0,  we  have  Aoj  +  Y2i= i  K,jhj  =  0.  On  the  other  hand,  if  F  is  a  multiplicative  rela¬ 
tion,  that  is,  if  F  =  x q  Xi2  .  —  Xi3  .  for  some  0  <  j  <  k,  then  because  dim(/4+g+.,(a)(y))  =  1  we  see  that  there  is 
some  scalar  7  so  that  for  any  pair  (a,  (3)  we  have  7 (aAq  .  +/?)  =  ahj3  .  +  (3ht2 


have  7  =  (unless  both  are  zero,  in  which  case  the  equality  hjx  hj2  .  =  hj. 


to  obtain  =  hj2  . ,  or  hi2 

1,3 


hil,j  ^*3,; 


Specializing  to  fi  =  0  and  a  =  1,  we 
holds  anyway),  which  we  substitute 

□ 


Taking  b  :=  (ht+i,  ■  ■  ■ ,  hq)  we  see  that  (a,  b)  €  V(S)( Q). 


□ 


5.3  Computing  upper  and  lower  bounds  on  Shbl 

5.3.1  Bounding  shbl  below:  Approximating  V  by  a  superpolytope  via  sampling  constraints 

While  the  approach  in  Section  5.1  demonstrates  that  V  can  be  computed  exactly,  it  follows  from  (5.1)  that  we  can 
terminate  the  algorithm  at  any  step  N  and  obtain  a  superpolytope  Vn  2  V.  Thus,  the  minimum  sum  shbl.at  of 
s  €  Vn  is  a  lower  bound  on  the  desired  shbl-  Since  our  communication  lower  bound  is  inversely  proportional  to 
Ms hbl— 1j  sHBLjj\r  may  lead  to  an  erroneous  bound  which  is  larger  than  the  true  lower  bound.  But  the  erroneous 
bound  may  still  be  attainable,  possibly  leading  to  a  faster  algorithm. 

We  note  that,  since  shbl,jv  is  nondecreasing  in  N,  the  approximation  can  only  improve  as  we  take  more  steps 
of  the  algorithm.  The  question  remains  of  how  to  systematically  choose  a  sequence  of  subspaces  W±, . . . ,  Wn  of 
V  that  are  likely  to  yield  the  best  estimate  of  shbl,  he.,  the  fastest  growing  shbl,at-  Furthermore,  even  if  Vn  is 
a  proper  superpolytope,  it  may  still  be  that  Shbl.jv  =  shbl>  and  we  could  stop  the  algorithm  earlier.  In  Part  2 
of  this  work,  we  will  discuss  ways  to  improve  the  efficiency  of  the  approach  in  Section  5.1,  including  choosing  the 
subspaces  W,  and  detecting  this  early-termination  condition  (see  Section  8). 

5.3.2  Bounding  shbl  above:  Approximating  V  by  a  subpolytope  via  embedding  into 

The  approximation  just  presented  in  Section  5.3.1  yielded  an  underestimate  of  shbl-  Now  we  discuss  a  different 
approximation  that  yields  an  overestimate  of  shbl-  If  the  overestimate  and  underestimate  agree,  then  we  have  a 
proof  of  optimality,  and  if  they  are  close,  it  is  also  useful. 

Recall  in  Section  5.2,  we  hoped  to  answer  the  following  (possibly  undecidable)  question. 

“Given  integers  0  <  r,  n, . . . ,  rm  <  d,  is  there  a  subgroup  F[  <  Zd  such  that  rank(ff)  =  r, 
rank(^i(iJ))  =  n,  . . . ,  rank(^m(74))  =  rm?” 

If  such  an  U  exists,  we  know  the  linear  constraint  r  <  ff)'--!  SjVj  is  one  of  the  conditions  (3.1),  which  define  the 
polytope  V  of  feasible  solutions  (si, . . . ,  sm ).  We  now  ask  the  related  question, 

“. .  .is  there  a  subspace  V  <  Rd  such  that  dirn(F)  =  r,  dim(^i(R))  =  n,  . . . ,  dim((/>m(Vr))  =  rm?” 

Here,  we  reinterpret  <pj  as  an  R-linear  map.  It  turns  out  that  computing  the  set  of  inequalities  (3.8)  for  the  HBL 
datum  (Rd,  (Kdj),  (< j>j))  is  easier  than  in  the  previous  case,  in  the  sense  of  being  Tarski-decidable.  Following  the 
notation  in  Section  5.2,  we  define 

Ed-,r,r1:...,rm  :=  Wi)  •  •  • ,  4>m)  G  (Md(R))m  :  (3V  <  Rd)  dim(V)  =  r  and  dim(^-(V))  =  rj,  1  <  j  <  m}. 
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Theorem  5.26.  It  is  Tarski-decidable  [21]  to  decide  membership  in  the  sets  £7®  r  . 

Proof.  There  is  an  effective  algorithm  (the  cylindrical  algebraic  decomposition  [7])  to  decide  whether  a  system  of 
real  polynomial  equations  (e.g.,  that  in  Remark  5.12)  has  a  real  solution.  □ 

As  opposed  to  the  rational  embedding  considered  in  Section  5.2,  this  real  embedding  is  a  relaxation  of  the 
original  question.  That  is,  if  H  <  Zd  exists,  then  V,  the  real  vector  subspace  of  generated  by  H ,  exists 
with  dim(F)  =  rank(77);  however,  there  may  be  V  which  do  not  correspond  to  any  H  <  Zd.  In  other  words, 
the  existence  of  V  is  a  necessary  but  not  sufficient  condition  for  the  existence  of  such  a  subgroup  H  <  Zd.  Since 
dim(0j(V))  =  rank (</>j(H)),  we  see  that  the  set  of  inequalities  (3.8)  for  the  vector  space  HBL  datum  (Rd,  (Rdj),  (4>j)) 
is  a  superset  of  the  inequalities  (3.1)  for  the  Abelian  group  HBL  datum  (Zd,  (Zdj),  (or  equivalently,  for  the 
rational  embedding  considered  in  Section  5.2).  In  terms  of  the  polytopes  defined  by  these  inequalities,  we  have 

V  :=P(Md,(RdO,(0i))  C  V, 

i.e. ,  the  constraints  for  the  relaxed  problem  give  a  polytope  V  of  feasible  solutions  s  =  (si, . . . ,  sm),  which  is 
a  subpolytope  (subset)  of  V.  If  the  relaxed  constraints  are  a  proper  superset,  then  P  is  a  proper  subpolytope. 
This  means  that  the  conclusion  (3.2)  of  Theorem  3.2  is  still  valid  for  any  s  €  V,  but  the  upper  bound  may  be 
too  large.  Therefore,  the  communication  lower  bound  that  is  inversely  proportional  to  MSHBL_1,  where  Shbl  := 
minsgp  Sj ,  is  also  still  valid.  However,  it  may  be  smaller  than  the  lower  bound  that  is  inversely  proportional 

to  MSHBL_1,  where  sHbl  :=  nhns6p  J2jLi  Sj- 

In  other  words,  it  is  Tarski-decidable  to  write  down  a  communication  lower  bound,  but  it  may  be  strictly  smaller 
than  the  best  lower  bound  implied  by  Theorem  3.2,  and  obtained  by  the  approach  in  Section  5.1. 

6  Easily  computable  communication  lower  bounds 

As  mentioned  in  Section  5,  there  may  be  many  equivalent  linear  programs  to  determine  shbl  that  may  not  include 
the  set  of  inequalities  (3.1).  The  approach  in  Section  5.1  shows  that  it  is  always  possible  to  write  down  one  such 
linear  program.  In  this  section,  we  discuss  three  practical  cases  where  we  can  write  down  another  (equivalent)  linear 
program  without  applying  the  approach  in  Section  5.1. 

First,  in  Section  6.1,  we  demonstrate  some  transformations  in  that  let  us  simplify  the  input  HBL  datum 
(Zd,  (Z dj),{4>j))  without  altering  Shbl-  Then  in  Section  6.2  we  examine  two  trivial  cases  where  we  can  imme¬ 
diately  write  down  an  equivalent  linear  program.  In  the  first  case  (Section  6.2.1),  we  show  that  we  can  always 
immediately  recognize  when  the  feasible  region  is  empty;  therefore  no  shbl  exists  and  arbitrary  data  reuse  is  pos¬ 
sible.  In  the  second  case  (Section  6.2.2),  when  one  or  more  <f>j  is  an  injection,  we  can  immediately  determine  that 
Shbl  =  1>  that  is,  no  asymptotic  data  reuse  is  possible.  Lastly,  in  Section  6.3,  we  present  the  most  general  case 
solved  explicitly  in  this  paper,  when  the  subscripts  of  each  array  are  a  subset  of  the  loop  indices.  We  apply  this  to 
several  examples:  the  direct  IV-body  problem,  database  join,  matrix-vector  multiplication,  and  tensor  contractions, 
as  well  as  three  examples  seen  earlier:  matrix-matrix  multiplication,  “complicated  code”  (from  Section  1),  and 
computing  matrix  powers  (from  Section  4). 

There  are  variety  of  other  situations  in  which  we  can  effectively  compute  the  communication  lower  bound, 
without  applying  the  approach  in  Section  5.1.  These  are  summarized  in  Section  8,  and  will  appear  in  detail  in 
Part  2  of  this  paper. 

6.1  Simplifying  the  linear  constraints  of  (3.1) 

If  ra,nk((f>j(Zd))  =  0  (e.g.,  array  Aj  is  a  scalar),  we  say  subscript  <j>j  is  a  trivial  map,  since  it  maps  all  indices  I  to 
the  same  subscript.  The  following  remark  shows  that  we  can  ignore  trivial  maps  when  computing  a  communication 
lower  bound,  without  affecting  shbl- 

Remark  6.1.  Suppose  we  are  given  the  HBL  datum  (Zd,  (Zdj),  (< f>j ))  which  satisfies  rank(^ (Zd))  =  0  for  indices 
k  <  j  <  to.  Then  (3.1)  becomes 

k 

rank(lt)  <  Sj  rank for  all  subgroups  H  <  Zd. 
i= i 

So  given  feasible  exponents  (sj)j=1  for  the  HBL  datum  (Zd,  (Zdj )*=1,  (4>j)j=i),  we  are  free  to  pick  Sj  £  [0,1]  for 
k  <  j  <  to.  Towards  minimizing  shbl?  we  would,  of  course,  pick  Sj  =  0  for  these  indices.  So,  we  ignore  trivial 
maps  when  computing  a  communication  lower  bound. 
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It  is  also  possible  that  At  and  Aj  have  the  same  subscript  fa  =  fa.  The  following  remark  shows  that,  as  with 
trivial  maps,  we  can  ignore  such  duplicate  subscripts,  without  affecting  shbl- 

Remark  6.2.  Given  the  HBL  datum  (T,d,  (Zdj),  (fa)),  suppose  ker (fa)  =  •  •  •  =  ker (<j>m).  Then  fa(H)  =  fa(H)  for 
any  k  <i,j<  m.  Then  the  constraint  (3.1)  becomes 

k— 1  m  k 

rank  (If)  <  Sj  vank(<f)j(H))  +  rank (fa(H))  E*  =  E  tj  rank ((f>j(H))  for  all  subgroups  H  <  hd . 

3  = 1  i—k  3— 1 

So  given  feasible  exponents  t  =  (tj)k=1  for  the  HBL  datum  (Zd,(Zdi)j=1,(fa)j=i),  we  have  a  family  of  feasible 
exponents  {s  =  (sj)JL1  £  [0,  l]m}  for  the  original  datum,  satisfying  Sj  =  tj  for  1  <  j  <  k  and  Sk  +  ■  •  ■  +  sm  =  tk- 
Our  objective  shbl  only  depends  on  the  sum  of  the  exponents  (i.e.,  computing  t  suffices);  thus  we  ignore  duplicate 
maps  when  computing  a  communication  lower  bound. 

So  in  the  remainder  of  this  section,  we  assume  that  the  group  homomorphisms  fa, ... ,  (f>m  are  distinct  and 
nontrivial.  The  arrays  Aj  however,  may  overlap  in  memory  addresses;  this  possibility  does  not  affect  our  asymptotic 
lower  bound,  given  our  assumption  from  Section  4  that  m  is  a  constant,  negligible  compared  to  M. 

6.2  Trivial  Cases 

There  are  a  few  trivial  cases  where  no  work  is  required  to  find  the  communication  lower  bound:  when  the  homomor¬ 
phisms’  kernels  have  a  nontrivial  intersection,  and  when  at  least  one  homomorphism  is  an  injection.  These  cover 
the  cases  where  d  =  1  or  m  =  1.  (When  m  =  0  there  are  no  arrays,  and  when  d  =  0  there  are  no  loops,  cases  we 
ignore.) 

The  following  lemma  shows  that  shbl  =  0  only  arises  in  the  trivial  d  =  0  case;  otherwise,  when  the  constraints 
(3.1)  are  feasible  and  shbl  exists,  then  shbl  >  1. 

Lemma  6.3.  If  d  >  0  and  s  £  [0,  l]m  satisfies  (3.1),  then  J2jLi  sj  —  1- 

Proof.  If  not,  then  any  nontrivial  subgroup  is  supercritical  with  respect  to  s.  □ 

6.2.1  Infeasible  Case 

The  following  result,  a  companion  to  Theorem  3.2,  gives  a  simpler  necessary  and  sufficient  condition  for  there  to 
exist  some  exponent  s  which  satisfies  (3.2). 

Lemma  6.4.  There  exists  s  £  [0, l]m  for  which  (3.2)  holds  if  and  only  if 

m 

n  ker(<A?) = {°}-  (e-1) 

1=1 

Proof.  If  n,ker(</>j)  =  {0},  then  Lemma  3.27  asserts  that  (3.2)  holds  with  all  Sj  equal  to  1.  Conversely,  if  H  = 
C  .  ker(</>j)  is  nontrivial,  then  for  any  finite  nonempty  subset  E  C  H ,  4>j(E)  =  {0}  for  every  index  j,  so  \f>j(E)\  =  1. 
Thus  the  inequality  (3.2)  fails  to  hold  for  any  E  C  H  of  cardinality  >2.  □ 

So,  if  f)/  ker  (fa)  ^  {0},  then  there  is  no  shbl,  and  there  are  arbitrarily  large  sets  E  that  access  at  most  m  <C  M 
operands;  if  operands  may  reside  in  cache  before/after  the  program’s  execution,  then  a  communication  lower  bound 
of  0  is  attainable  (see  Section  7.1.1). 

If  m  =  1  or  d  =  1,  then  either  we  fall  into  this  case,  or  that  of  Section  6.2.2. 

Example:  Matrix  Powering  (Part  2/4).  Recall  the  ‘Original  Code’  for  computing  B  =  Ak  (for  odd  k)  from 
part  1/4  of  this  example.  The  lines  C  =  A  ■  B  and  B  =  A  ■  C  are  matrix  multiplications,  i.e.,  each  is  3  nested 
loops  (over  *2,  «3,  u).  As  discussed  in  part  1/4,  we  ignore  the  second  matrix  multiplication,  leaving  only  C  =  A  -  B. 
This  yields  4>a(h,  h,  h,  u)  =  (i2,u),  4>b(h,  *2,  *3,  k)  =  (h,U),  and  <fc(fa,  *2,  fa,  fa)  =  (hAz)-  Observe  that  the 
subgroup  ker(^)  n  ker (4>b)  O  kei/^c1)  =  (i2  =  *3  =  i\  =  0)  <  Z4  is  nontrivial,  so  by  Lemma  6.4,  V  is  empty  and  no 
shbl  exists.  This  tells  us  that  there  is  no  lower  bound  on  communication  that  is  proportional  to  the  number  of  loop 
iterations  \_k/2\N3.  Indeed,  if  one  considers  an  execution  where  the  i\  loop  is  made  the  innermost,  then  we  can 
increase  k  arbitrarily  without  additional  communication.  (Recall  that  our  bounds  ignore  data  dependencies,  and  do 
not  require  correctness.)  So  to  derive  a  lower  bound  and  corresponding  optimal  algorithm,  we  apply  the  approach 
of  imposing  Reads  and  Writes  from  Section  4.  We  will  continue  this  example  in  Section  6.3,  after  introducing 
Theorem  6.6.  A 
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6.2.2  Injective  Case 


If  one  of  the  array  references  is  an  injection,  then  each  loop  iteration  will  require  (at  least)  one  unique  operand,  so 
at  most  M  iterations  are  possible  with  M  operands  in  cache.  This  observation  is  confirmed  by  the  following  lemma. 

Lemma  6.5.  If  4>k  is  injective,  then  the  minimal  shbl  —  1- 

Proof.  We  see  that  s  :=  ek  £  V  by  rewriting  (3.1)  as 

1  <  Sfc  +  E  Sj  rank(^>j(U))/rank(U)  for  all  subgroups  H  <  Zd. 

The  sum  shbl  =  1  is  minimal  due  to  Lemma  6.3.  □ 

As  anticipated,  the  argument  in  Section  4  gives  a  lower  bound  f l(M  ■  \Z\/F)  =  fi(|Z|),  that  is,  no  (asymptotic) 
data  reuse  is  possible. 

Example:  Matrix- Vector  Multiplication  (Part  1/3).  The  code  is 

for  i\  =  1  :  N,  for  i2  =  1  :  N, 
y(h)  +=  A(i i,i2)  •  x{i2) 

We  get  (j>y(ii,i2)  =  («i),  0a(*i,*2)  =  (*i, *2),  and  (j>x{h,i2)  =  (i2).  Clearly,  Tank((fA(Z2))  =  2,  i.e.,  f>A  is  an 
injection,  so  by  Lemma  6.5  we  have  shbl  =  1.  Each  iteration  1  =  (ii,i2)  requires  a  unique  entry  A(i\,i2),  so  at 
most  A fSHBL  =  M  iterations  are  possible  with  M  operands.  In  part  2/3  of  this  example  (in  Section  6.3,  below)  we 
arrive  at  the  lower  bound  shbl  =  1  in  a  different  manner  (via  Theorem  6.6).  A 

6.3  Product  Case 

Recall  the  matrix  multiplication  example,  which  iterates  over  T?  indexed  by  three  loop  indices  (ii,  i2,i 3),  with  inner 
loop  C(ii,i2)  +=  A(ii,i3)  '  B(ia,i2),  which  gives  rise  to  3  homomorphisms 


1U2U3)  —  (A,  *3))  fe(ii)»2i®3)  —  (*3)*2)>  <t>c(iiA  2, *3)  —  (*1;J2), 

all  of  which  simply  choose  subsets  of  the  loop  indices  i\,i2,i3.  Similarly,  for  other  linear  algebra  algorithms,  tensor 
contractions,  direct  V-body  simulations,  and  other  examples  discussed  below,  the  <pj  simply  choose  subsets  of  loop 
indices.  In  this  case,  one  can  straightforwardly  write  down  a  simple  linear  program  for  the  optimal  exponents  of 
Theorem  3.2.  We  present  this  result  as  Theorem  6.6  below  (this  is  a  special  case,  with  a  simpler  proof,  of  the  more 
general  result  in  [6,  Proposition  7.1]).  Using  Theorem  6.6  we  also  give  a  number  of  concrete  examples,  some  known 
(e.g.,  linear  algebra)  and  some  new. 

We  use  the  following  notation.  Zd  will  be  our  d-dinrensional  set  including  all  possible  tuples  of  loop  indices  I  = 
(ii, . . . ,  id).  The  homomorphisms  </>i, . . . ,  <f>m  all  choose  subsets  of  these  indices  Si, ,  Sm,  i.e.,  each  Sj  C  {1, . . . ,  d} 
indicates  which  indices  </>j  chooses.  If  an  array  reference  uses  multiple  copies  of  a  subscript,  e.g.,  A3{i\,  ii,  i2),  then 
the  corresponding  S3  =  {1,2}  has  just  one  copy  of  each  subscript. 

Theorem  6.6.  Suppose  each  4>j  chooses  a  subset  Sj  of  the  loop  indices,  as  described  above.  Let  Hi  <  T,d  denote 
the  subgroup  Hi  =  (ef),  i.e.,  nonzero  only  in  the  ith  coordinate,  for  i  £  {1, . . .  ,d}.  Then ,  for  any  Sj  £  [0, 1], 

m 

rank(f7j)  <  Sj  rank(^(i?,))  for  all  i  £  {1,2, ...  ,d}  (6.2) 

j= 1 

if  and  only  if 

m 

rank (H)  <  E  Sj  rank (cj)j(H))  for  all  subgroups  H  <  Zd.  (3-1) 

i= 1 

Proof.  Necessity  follows  from  the  fact  that  the  constraints  (6.2)  are  a  subset  of  (3.1). 

To  show  sufficiency,  we  rewrite  hypothesis  (6.2)  as 

m 

rank(Uj)  =  1  <  Sj5(j,  i)  for  all  i  £  {1, 2, . . . ,  d}, 

3= 1 
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where  5(j,i)  =  1  if  i  £  Sj,  and  S(j,i)  =  0  otherwise.  Let  H  <Zd  have  rank(ff)  =  h.  Express  H  as  the  set  spanned 
by  integer  linear  combinations  of  columns  of  some  rank  h  integer  matrix  Mh  of  dimension  d-  by —h.  Pick  some  full 
rank  h  by -h  submatrix  M'  of  Mh-  Suppose  M'  lies  in  rows  of  Mh  in  the  set  1Z  C  {1, . . . ,  d};  note  that  \R]  =  h. 
Then 

rank (<pj(H))  >  E  5(3, r), 

ren 

so 

mm  m 

ESJ  rank((/)J (H))  >  E  E  <5C?’r)  =  EEs^’r)  -  E  ™nk(Hr)  =  E  1  =  1^1  =  rank(ff). 

j= 1  j= 1  r£lZ  r£lZ  j=l  r£lZ  r£TZ 

□ 


Remark  6.7.  We  can  generalize  this  to  the  case  where  the  <pj  do  not  choose  subsets  Sj  of  the  indices  i\, . . . ,  id,  but 
rather  subsets  of  suitable  independent  linear  combinations  of  the  indices,  for  example  subsets  of  {i i,  i\  +  in,  ij,  —  2?'i} 
instead  of  subsets  of  {i\,  in,  *3}-  We  will  discuss  recognizing  the  product  case  in  more  general  array  references  in 
Part  2  of  this  paper. 


We  now  give  a  number  of  concrete  examples.  We  introduce  a  common  notation  from  linear  programming  that 
we  will  also  use  later.  We  start  with  the  version  of  the  hypothesis  (6.2)  used  in  the  proof,  1  <  sj5(j,i),  and 

rewrite  this  as  lj  <  sT  ■  A,  where  Id  £  Rd  is  a  column  vector  of  all  ones,  s  =  (si, . . . ,  sm)T  £  Rm  is  a  column 
vector,  and  A  €  MmX£i  is  a  matrix  with  A-j,  =  S(j,i).  Recall  that  the  communication  lower  bound  in  Section  4 
depends  on  sj  =  1  m  '  s>  an<^  that  the  lower  bound  is  tightest  (maximal)  when  lfn  ■  s  is  minimized.  This  leads 

to  the  linear  program 

minimize  ljn  ■  s  subject  to  lj  <  sT  ■  A,  (6-3) 

where  we  refer  to  the  optimal  value  of  lfn  •  s  as  shbl-  Note  that  (6.3)  is  not  necessarily  feasible,  in  which  case 
shbl  does  not  exist.  This  is  good  news,  as  the  matrix  powering  example  shows.  Since  A  has  one  row  per  <pj  (i.e., 
per  array  reference) ,  and  one  column  per  loop  index,  we  will  correspondingly  label  the  rows  and  columns  of  the  A 
matrices  below  to  make  it  easy  to  see  how  they  arise.  These  lower  bounds  are  all  attainable  (see  Section  7.2). 


Example:  Matrix- Vector  Multiplication  (Part  2/3).  We  continue  the  example  from  Section  6.2.2,  with 

(j)y(h,in)  =  (*i),  <t>A(h,i2)  =  (h ,h),  and  (j>x(ii,in)  =  (h),  or 


H 

V  (  1 
A=  A  1 
x  V  0 


*2 


0 

1 

1 


and  the  linear  program  (6.3)  becomes  minimizing  Shbl  =  sy  +  sa  +  sx  subject  to  sy  +  sa  >  1  and  sa  +  sx  >  1.  The 
solution  is  sx  =  sy  =  0  and  sa  =  1  =  shbl-  This  yields  a  communication  lower  bound  of  f 1(7V2/MShbl_1)  =  fl(N2). 
This  reproduces  the  well-known  result  that  there  is  no  opportunity  to  minimize  communication  in  matrix-vector 
multiplication  (or  other  so-called  BLAS-2  operations)  because  each  entry  of  A  is  used  only  once.  As  we  will  see  in 
Section  7.1.2,  this  trivial  lower  bound  (no  data  reuse)  is  easily  attainable.  A 

Example:  A  - Body  Simulation  and  Database  Join  (Part  1/3).  The  (simplest)  code  that  accumulates  the 
force  on  each  particle  (body)  due  to  all  N  particles  is 


for  ii  =  1  :  N ,  for  in  =  1  :  N, 

F(ii)  +=  compute_force(P(ii),  P(i 2)) 

We  get  (j)F(i\,in)  =  (A),  <t>p1(ii,  in)  =  (k),  and  4>p2(ii,i2)  =  (in),  or 


*1 

*2 

F  , 

(l 

0\ 

> 

II 

1 

0 

P2  ' 

Vo 

1/ 

Now  the  solution  of  the  linear  program  is  not  unique:  sF  =  a,  spx  =  1  —  a  and  sp2  =  1  is  a  solution  for  any 
0  <  a  <  1,  all  yielding  shbl  =  2  and  a  communication  lower  bound  of  Q(N2 /M). 
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Consider  also  the  similar  program 


for  ii  =  1  :  N\,  for  i2  =  1  :  N2, 
if  predicate(i?(ii),  5(12))  =  true, 

output(ii,  *2)  =  func(i?(*i),  S(i2)) 


which  represents  a  generic  “nested  loop”  database  join  algorithm  of  the  sets  of  tuples  R  and  S  with  |i?|  =  Ni  and 
\S\  =  N%.  We  have  a  data-dependent  branch  in  the  inner  loop,  so  we  split  the  iteration  space  Z  —■  ZtrUe  +  -Zfaise 
depending  on  the  value  of  the  predicate  (we  still  assume  that  the  predicate  is  evaluated  for  every  (ii,  *2))-  This 
gives 


H  *2 

R  /  1  0 

Atrue=  S  0  1 

output  \  1  1 


A  false  — 


R  1  0 


leading  to  shbl, true  =  1  and  Shbl, false  =  2,  respectively.  Writing  |Ztrue|  =  cnjZj  =  aN\N2,  then  we  obtain  lower 
bounds  Cl(aN\N2)  and  f2((l  — a/A/A^/M)  for  iterations  Ztrue  and  Zfa\se,  respectively.  So  we  can  take  the  maximum 
of  these  two  lower  bounds  to  obtain  a  lower  bound  for  the  whole  program.  Altogether,  this  yields 


max  (f2(aiVi IV2),  f2((l  —  a/AiA^/M))  =  Cl(max(aNiN2,  N\N2/M)), 


an  increasing  function  of  a  G  [0, 1]  (using  the  fact  that  (M  +  1  )/M  =  0(1)).  When  a  is  close  enough  to  zero,  we 
have  the  ‘best’  lower  bound  A(N-[  N2/M),  and  when  a  is  close  enough  to  1,  we  have  0(oW-|  N2).  the  size  of  the 
output,  a  lower  bound  for  any  computation.  A 

Example:  Matrix  Multiplication  (Part  3/5).  We  outlined  this  result  already  in  part  2/5  of  this  example  (see 
Section  1).  We  have  4/4 (*  1A2A3)  =  (k,  h),  4>b(h,  h ,h)  =  (*31*2),  and  =  (A  A 2),  or 


*1  «2  h 

All  0  1  \ 

A  =  B  0  1  1  . 

C  \  1  1  0  / 

One  may  confirm  that  the  corresponding  linear  program  has  solution  =  Sb  =  sq  =  1/2  so  Shbl  =  3/2,  yielding 
the  well  known  communication  lower  bound  of  fl(N3 /M1/2).  Recall  from  Section  2  that  the  problem  of  matrix 
multiplication  was  previously  analyzed  in  [13]  using  the  Loomis- Whitney  inequality  (Theorem  2.1,  a  special  case  of 
Theorem  3.2)  and  extended  to  many  other  linear  algebra  algorithms  in  [4].  A 

Example:  Tensor  Contraction  (Part  1/2).  Assuming  l<j<k  —  l<d,  the  code  is 

for  ii  =  l:  iV,  . . . ,  for  i(j  =  1  :  TV, 

C{l\,  .  .  .  ,  ij ,  A ,  .  .  .  ,%d)  “t~ —  A(?4,  .  .  .  Ak—l)  '  -^(L?  + 1 1  ‘  i^d) 

We  get 

i\  •  •  •  ij  ij-\- 1  •  •  •  ik— 1  *  *  *  id 

At  1  •••  1  1  •••  1  0  •••  0\ 

A=  B  0  •••  0  1  •••  1  1  •••  1  . 

C\1  •••  1  0  •••  0  1  •••  1/ 

The  linear  program  is  identical  to  the  linear  program  for  matrix  multiplication,  so  shbl  =  3/2,  and  communication 
lower  bound  is  f l(Nd/M1l2).  A 

Example:  Complicated  Code  (Part  2/4).  We  already  outlined  this  example  in  part  1/4  (see  Section  1).  The 
code  given  in  part  1/4  leads  to 

*i  *2  *3  *4  *5  *6 

At  /  1  0  1  0  0  1  \ 

a2  110  10  0 

A  A3)1  0  110  10 

A4,A3i2  0  0  1  1  0  1’ 

a5  0  1  0  0  0  1 

a6  \  1  0  0  1  1  0  / 
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since  A3  is  accessed  twice,  once  with  the  same  subscript  as  A4,  so  we  have  ignored  the  duplicate  subscript  following 
Remark  6.2.  We  obtain  s  =  (2/7, 1/7,  3/7,  2/7,  3/7, 4/7)T  as  a  solution,  giving  shbl  =  15/7,  so  the  communication 
lower  bound  is  Ct(N6 /M8/7).  A 


Example:  Matrix  Powering  (Part  3/4).  We  continue  this  example  from  part  2/4  (above),  with  <f>A,  (f>B  and 
4>c  as  given  there.  So  we  have 

*1  *2  *3  *4 

A  ( 0  1  0  1  \ 

A  =  B  0  0  1  1 

c  \  0  1  1  0 ) 

The  first  column  of  A  gives  us  the  infeasible  constraint  0  •  +  0  •  Sb  +  0  •  sq  >  1.  As  mentioned  above,  we  can 

reorganize  the  loops  so  that  we  can  increase  k  arbitrarily  without  communication.  The  issue  is  that  the  intermediate 
powers  of  A  do  not  occupy  unique  memory  addresses,  but  rather  overwrite  the  previous  power  (stored  in  the  buffers 
B  and  C).  Following  the  strategy  of  imposing  Reads  and  Writes,  we  apply  Theorem  6.6  to  the  second  code  fragment 
in  part  1/4  of  this  example  (see  Section  2),  yielding 


*  1  *2  *3  *4 

Ai  /  1  1  1  0  \ 

A  =  A2  0  1  0  1 

a3  V 1  0  1  1  j 

where  the  subscripts  on  A  distinguish  its  three  appearances.  The  resulting  linear  program  is  feasible  with  si  = 
S2  —  S3  =  1/2  and  shbl  =3/2,  yielding  a  communication  lower  bound  of  ckN3 /M1'2  for  some  constant  c  >  0.  As 
mentioned  in  part  1/4,  the  number  of  imposed  Reads  and  Writes  is  at  most  2  (k  —  2)  IV2,  so  subtracting  we  get  a 
communication  lower  bound  for  the  original  algorithm  of  ckN3 /M1/2  —  2 (k  —  2 )N2.  When  N  =  ^(M1/2),  the  first 
dominates,  and  we  see  the  complexity  is  the  same  as  k  independent  matrix  multiplications.  When  N  =  0(M1/2), 
the  second  term  may  dominate,  leading  to  a  zero  lower  bound,  which  reflects  the  fact  that  one  can  read  the  entire 
matrix  A  into  fast  memory,  perform  the  entire  algorithm  while  computing  and  discarding  intermediate  powers  of 
A ,  and  only  write  out  the  final  result  Ak .  In  other  words,  any  communication  lower  bound  that  is  proportional  to 
the  number  of  loop  iterations  \Z\  =  {k  —  1)N3  must  be  zero.  A 

7  Communication-optimal  algorithms 

Our  ultimate  goal  is  to  take  a  simple  expression  of  an  algorithm,  say  by  nested  loops,  and  a  target  architecture, 
and  automatically  generate  a  communication  optimal  version,  i.e. ,  that  attains  the  lower  bound  of  Theorem  4.1, 
assuming  the  dependencies  between  loop  iterations  permit.  In  this  section  we  present  some  special  cases  where  this 
is  possible,  corresponding  to  the  cases  in  Section  6  where  we  could  effectively  compute  the  linear  constraints  (3.1). 

The  rest  of  this  section  is  organized  as  follows.  Section  7.1  briefly  addresses  the  infeasible  and  injective  cases 
described  in  Section  6.2.  Section  7.2  presents  communication-optimal  sequential  algorithms  for  the  product  case 
whose  lower  bound  was  presented  in  Section  6.3.  Section  7.3  presents  communication-optimal  parallel  algorithms 
for  the  same  product  case,  including  generalizations  of  recent  algorithms  like  2.5D  matrix  multiplication  [20],  which 
use  additional  memory  (larger  M)  to  replicate  data  and  reduce  communication.  In  Part  2  of  this  paper  (outlined 
in  Section  8)  we  will  discuss  attainability  in  general. 

7.1  Trivial  Cases 

7.1.1  Infeasible  Case 

As  shown  in  Lemma  6.4,  if  H  =  fljli  ker((£j)  is  nontrivial,  then  shbl  doesn’t  exist.  That  is,  since  H  is  infinite,  we 
can  potentially  perform  an  unbounded  number  of  iterations  I  €  H  with  the  same  m  operands  (which  fit  in  cache, 
since  in  =  o(M)  by  assumption  —  see  Section  4). 

Recall  the  matrix  powering  example  from  the  previous  section  (parts  2/4  and  3/4).  In  part  2/4,  we  saw  a 
lower  bound  of  0  words  moved  was  possible,  provided  the  matrix  dimension  N  was  bounded  in  terms  of  M1/2.  We 
then  showed  in  part  3/4  how  to  obtain  a  lower  bound  on  the  modified  code  with  imposed  Reads  and  Writes  (from 
part  1/4,  in  Section  4)  which  is  sometimes  tighter,  depending  on  the  relative  sizes  of  N  and  M.  In  part  4/4  below, 
we  show  that  the  tighter  bound  is  also  attainable,  by  reorganizing  the  modified  code. 
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7.1.2  Injective  Case 


As  shown  in  Lemma  6.5,  if  there  is  an  injective  <pj,  we  have  shbl  =  1.  Thus  no  data  reuse  is  possible  asymptotically. 
So,  executing  the  loop  nest  in  any  order  (of  the  iteration  space)  will  attain  the  lower  bound.  We  revisit  the  matrix- 
vector  multiplication  example  from  the  previous  section  (parts  1/3  and  2/3)  in  part  3/3  below. 


7.2  Product  Case  —  Sequential  Algorithms 

Here  we  show  how  to  find  an  optimal  sequential  algorithm  in  the  special  case  discussed  in  Theorem  6.6,  where 
the  subscripts  4>j  all  choose  subsets  Sj  of  the  loop  indices  i\,...,id-  Recalling  the  notation  from  Section  6.3,  (6.3) 
provided  a  linear  program  for  s  =  (si, . . . ,  sm)T  (that  we  call  “sLP”  here), 

(sLP)  minimize  ljn  ■  s  subject  to  lj  <  sT  •  A, 

where  A  £  Rmxd  was  a  ma^rjx  with  A jt  =  1  if  i  g  Sj  and  0  otherwise.  We  let  shbl  denote  the  minimum  value  of 
1T  •  s 

±m  ** 

For  notational  simplicity,  we  consider  a  computer  program  of  the  form 


for  i-\  =  l  :  N.  for  i2  =  1  :  A,  . . . ,  for  id  =  1  :  N, 
inner_loop(?'i,  *2, . . .  ,id) 


and  ask  how  to  reorganize  it  to  attain  the  communication  lower  bound  of  f 2(iVd/MSHBL_1)  in  a  sequential  imple¬ 
mentation.  As  explained  in  Section  4,  explicit  loop  nests  are  not  necessary,  but  it  is  easier  to  explain  the  result. 
We  consider  a  reorganization  of  the  following  “blocked”  or  “tiled”  form: 

for  j-\  =  1  :  MXl  :  N,  for  j2  =  1  :  Mx‘2  :  N,  . . . ,  for  jd  =  1  :  MXd  :  N, 

for  fci  =  0  :  MX1  -  1,  for  k2  =  0  :  MX2  -  1,  ...,  for  kd  =  0  :  MXd  -  1, 

(*1,*2,  ■  ■  ■  ,id)  =  (ji,j2,  •  ■  • ,  jd)  +  (ki,k2, ...  ,kd ) 
inner _loop(ii, i2, . . .  ,id) 


Here  M  is  the  fast  memory  size,  and  xi, . . . ,  Xd  are  parameters  to  be  determined  below. 

Here  we  show  that  simply  solving  the  dual  linear  program  of  (sLP)  provides  the  values  of  x\ , . . .  ,Xd  that  we 
need  for  the  tiled  code  to  be  optimal,  i.e. ,  to  attain  the  communication  lower  bound: 

Theorem  7.1.  Suppose  that  the  linear  program  (sLP)  for  s  is  feasible.  Then  the  dual  linear  program  “xLP”  for 
x  =  (xi, . .  .,xd)T 

( xLP )  maximize  lj  •  x  subject  to  lm  >  A  •  x, 

is  also  feasible,  and  using  these  values  of  x  in  the  tiled  code  lets  it  attain  the  communication  lower  bound  of 
n(Ad/MSHBL^1)  words  moved. 

Proof.  By  duality  the  solution  x  of  (xLP)  satisfies  lj  ■  x  =  1  m  ■  s  =  Shbl-  Let  E  =  {(?'i, . . . , id)}  C  Zd  be  the  set 
of  lattice  points  traversed  by  the  d  innermost  loops  of  the  tiled  code,  i.e.,  the  loops  over  k\, . . . ,  kd,  for  one  value 
of  (ji,  ...,jd).  It  is  easy  to  see  that  the  size  of  E  is  \E\  =  Tlf=i  MXi  =  =  MSHBL.  Furthermore,  the 

constraint  lm  >  A  •  x  means  that  for  each  j,  1  >  A jiXi  =  ^2ieS  xi-  with  equality  for  at  least  one  j.  Thus 

\<f>j{E)\  =  Iliss  x '  <  M 1  =  M.  In  other  words,  the  number  of  words  accessed  by  the  jth  array  in 

the  innermost  d  loops  is  at  most  M ,  with  eciuality  at  least  once.  Thus  the  inner  d  loops  of  the  algorithm  perform 
the  maximum  possible  MSHBL  iterations  while  accessing  @(M)  words  of  data,  so  the  algorithm  is  optimal.  □ 

We  note  that  in  practice,  we  may  want  to  choose  the  block  sizes  MXi  to  be  a  constant  factor  smaller,  in  order  for 
all  the  data  accessed  in  the  innermost  d  loops  to  fit  in  fast  memory  memory  simultaneously.  We  will  look  at  these 
constants  in  Part  2  of  this  work. 

We  look  again  at  the  examples  from  Section  6.3,  using  the  notation  introduced  there. 

Example:  Matrix- Vector  Multiplication  (Part  3/3).  We  write  x  =  (x  \ ,  x2)T  as  the  unknown  in  the  dual 
linear  program  (xLP),  so  that  MXl  is  the  tile  size  for  variable  i\  and  MX2  is  the  tile  size  for  variable  i2.  Then 
(xLP)  becomes  maximizing  shbl  =  X\  +x2  subject  to  Xi  <  1,  xi  +X2  <  1,  and  aq  <  1.  The  solution  is  xi  =  a  and 
x2  =  1  —  a  for  any  0  <  a  <  1,  and  Shbl  =  1.  In  other  words,  we  can  “tile”  the  matrix  A  into  any  MQ-by-M1_a 
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rectangular  blocks  of  total  size  M.  Indeed,  such  flexibility  is  common  when  one  of  the  subscripts  is  injective  (see 
Section  7.1.2,  above). 

We  can  use  our  approach  to  further  reduce  communication,  potentially  up  to  a  factor  of  2,  by  choosing  the 
optimal  a.  We  simply  ignore  the  array  A ,  and  apply  the  theory  just  to  the  vectors  x  and  y,  leading  to  the  same 
analysis  as  the  .ZV-body  problem,  which  tells  use  to  use  blocks  of  size  M  for  both  x  and  y  (see  next  example).  We 
still  read  each  entry  of  A  once  (moving  N 2  words),  but  reuse  each  entry  of  x  and  y  M  times.  A 

Example:  iV-Body  Simulation  and  Database  Join  (Part  2/3).  We  write  x  =  (x-\ ,  X‘>)r  as  the  unknown  in 
the  dual  linear  program  (xLP),  so  that  MXl  is  the  tile  size  for  variable  i\  and  MX2  is  the  tile  size  for  variable  A  ■ 
Then  (xLP)  becomes  maximizing  shbl  =  X\  +  X2  subject  to  £i  <  1  and  £2  <  1.  The  solution  is  X\  =  £2  =  1  and 
shbl  =  2,  which  tells  us  to  to  take  M  particles  indexed  as  P(ii),  M  (usually  different)  particles  indexed  as  P(i 2), 
and  compute  all  M2  pairs  of  interactions. 

For  the  database  join  example,  we  use  the  optimal  blocking  for  the  iterations  Afaise,  which  also  yields  £1  =  £2  =  1, 
and  so  compute  predicate(i?(A),  S/A))  M'2  times  using  M  values  each  of  Ril  and  Sl2 .  Whenever  the  predicate  is 
true,  output (ii,  *2)  is  immediately  written  to  slow  memory.  Thus  there  are  Ni  N% j M  reads  of  R(i\)  and  S/A),  and 
|2true|  =  otN\N2  writes  of  output  (A,  *2)1  which  attains  the  lower  bound.  A 

Example:  Matrix  Multiplication  (Part  4/5).  We  write  x  =  (x  \ ,  £2,  £.3)T  as  the  unknown  in  the  dual  linear 
program  (xLP),  so  that  MXl  is  the  tile  size  for  variable  A,  MX2  is  the  tile  size  for  variable  A,  and  MX3  is  the  tile 
size  for  variable  A-  Then  (xLP)  becomes  maximizing  Shbl  =  £1  +  £2  +  £3  subject  to  £1  +  £3  <  1,  £1  +  £2  <  1 
and  £2  +  £3  <  1.  The  solution  is  £1  =  £2  =  £3  =  1/2  and  shbl  =  3/2,  which  tells  us  the  well-known  result  that 
each  matrix  A ,  B ,  and  C  should  be  tiled  into  square  blocks  of  size  M1/2-by -Af1/2.  When  M  is  the  cache  size 
on  a  sequential  machine,  this  is  a  well-known  algorithm,  with  a  communication  cost  of  0(N3 /Af1/2)  words  moved 
between  cache  and  main  (slower)  memory.  A 


Example:  Tensor  Contraction  (Part  2/2).  We  write  x  =  (x\, . . . ,  Xd)T  as  the  unknown  in  the  dual  linear 
program  (xLP),  so  that  MXr  is  the  tile  size  for  variable  ir.  Then  (xLP)  becomes  maximizing  Shbl  =  X/,-=i  A- 
subject  to 

k— 1  d  j  d 

xr  <  1,  xr  <  1,  and  xr  +  xr  <  1. 

r—  1  r=j+l  r—  1  r—k 

The  solution  is  any  set  of  values  of  0  <  xr  <  1  satisfying 


j  k- 1 

5>r  =  i/2,  xr 

r—l  r—j+1 


1/2, 


d 

and  xr  =  1/2, 

r—k 


so  shbl  =  3/2.  For  example,  one  could  use 


£i  ~  ~  xj  —  2  •  ’  xj+ 1  ~  ' ' 


—  %k— 1  — 


2{k-j-iy 


and  Xk  =  •  •  •  =  Xd  = 


1 


2{d-k  +  l) 


which  is  a  natural  generalization  of  matrix  multiplication.  A 

Example:  Complicated  Code  (Part  3/4).  We  write  x  =  (x\, . . .  ,Xq)t  as  the  unknown  in  the  dual  linear 
program  (xLP),  so  that  AIXr  is  the  tile  size  for  variable  ir.  The  solution  is 

£  =  (2/7, 3/7, 1/7, 2/7, 3/7, 4/7)T 

(and  shbl  =  15/7),  leading  to  the  block  sizes  M2/7- by - by -M4/7.  A 

Example:  Matrix  Powering  (Part  4/4).  While  (sLP)  is  infeasible,  (xLP)  has  unbounded  solutions.  As  sug¬ 
gested  above,  the  trivial  lower  bound  of  0  words  moved  may  not  always  be  attainable.  After  imposing  Reads 
and  Writes,  we  saw  in  part  3/4  (see  previous  section)  that  when  N  is  sufficiently  large  compared  to  A-/1/2,  the 
communication  lower  bound  (for  the  modified  program)  looks  like  k  independent  matrix  multiplications,  and  when 
N  =  (^(Af1/2),  the  lower  bound  degenerates  to  0.  To  attain  the  lower  bound  for  the  modified  code  with  imposed 
Reads  and  Writes,  we  implement  the  matrix  multiplication  A(i i,  •,  •)  =  A(  1,  •,  •)  •  A(ii  —  1,  •,  •)  in  the  inner  loop  in 
the  optimized  way  as  described  in  part  4/4  of  the  matrix  multiplication  example  (above),  and  leave  the  outer  loop 
(over  ii)  unchanged.  A 
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7.3  Product  Case  —  Parallel  Algorithms 

We  again  consider  the  special  case  discussed  in  Theorem  6.6,  and  show  how  to  attain  a  communication  optimal 
parallel  algorithm.  We  again  start  with  the  unblocked  code  of  the  last  section,  and  show  how  to  map  it  onto  P 
processors  so  that  it  is  both  load  balanced ,  i.e. ,  every  processor  executes  inner _loop(ii,  *2, . . . ,id)  0(Nd/P)  times, 
and  every  processor  attains  the  communication  lower  bound  fl{{Nd/P)/AdSBBB~1),  i.e.,  communicates  at  most 
this  many  words  to  or  from  other  processors.  We  need  some  assumption  about  load  balance,  since  otherwise  one 
processor  could  store  all  the  data  and  do  all  the  work  with  no  communication.  In  contrast  to  the  sequential  case, 
M  is  no  longer  fixed  by  the  hardware  (e.g.,  to  be  the  cache  size),  but  can  vary  from  the  minimum  needed  to  store 
all  the  data  in  the  arrays  accessed  by  the  algorithm  to  larger  values.  Since  the  lower  bound  is  a  decreasing  function 
of  M,  we  should  ideally  have  a  set  of  algorithms  that  communicate  less  as  M  grows  (up  to  some  limit).  There  is 
in  fact  a  literature  on  algorithms  for  linear  algebra  [20,  19,  23,  18]  and  the  iV-body  problem  [8]  that  do  attain  the 
lower  bound  as  M  grows;  here  we  will  see  that  this  is  possible  in  general  (again,  loop  dependencies  permitting). 

In  Section  1,  we  said  that  we  would  only  count  the  number  of  words  moved  between  processors,  not  other 
costs  like  message  latency,  network  congestion,  or  costs  associated  with  noncontiguous  data.  We  will  also  ignore 
dependencies.  This  will  greatly  simplify  our  presentation,  but  leave  significant  work  to  design  more  practical  parallel 
algorithms;  we  leave  this  until  Part  2  of  this  paper. 

Assume  for  a  moment  that  we  have  chosen  M ;  we  distribute  the  work  among  the  processors  as  follows:  Supposing 
that  the  linear  program  (sLP)  is  feasible,  let  x  =  (aq, .  . .  ,Xd)T  be  the  solution  of  the  dual  linear  program  (xLP) 
defined  in  Theorem  7.1.  Just  as  in  the  sequential  case,  we  take  the  lattice  of  Nd  points  representing  inner  loop 
iterations,  indexed  from  I  =  (ij_ ,  *2, . . . ,  id)  =  (0, 0, . . . ,  0)  to  (IV—  1,  IV—  1, . . . ,  IV—  1),  and  partition  the  axis  indexed 
by  ij  into  N/MXj  contiguous  segments  of  equal  size  MXj .  (As  before,  for  simplicity  we  assume  all  fractions  like 
N/Mxi  divide  evenly.)  This  in  turn  divides  the  entire  lattice  into  Ilf=i  N/MXi  =  Nd/AdSBBB  “bricks”  of  MSHBL 
lattice  points  each.  Let  E  refer  to  the  lattice  points  in  one  brick.  Just  as  in  the  proof  of  Theorem  7.1,  the  amount 
of  data  from  array  Aj  needed  to  execute  the  inner  loop  iterations  indexed  by  E  is  bounded  by  \<pj(E)\  <  Ad.  So,  up 
to  a  constant  multiple  in  (which  we  ignore),  each  processor  has  enough  memory  to  store  the  data  needed  to  execute 
a  brick.  This  yields  one  constraint  from  our  approach:  to  be  load  balanced,  there  need  to  be  at  least  P  bricks,  i.e., 
Nd/AdSHBB  >  P1  or  M  <  _/Vd/SHBL /P1/Shbl  .  In  other  words,  we  should  pick  Ad  no  larger  than  this  upper  bound  in 
order  to  use  all  available  processors.  (Of  course  M  also  cannot  exceed  the  memory  available  on  each  processor.) 

The  next  constraint  arises  from  having  enough  memory  in  all  the  processors  to  hold  all  the  data.  Recalling  the 
notation  used  to  define  (sLP),  we  let  S7  be  the  subset  of  loop  indices  appearing  in  < j>j ,  and  we  have  Sj  =  dj  = 
rank (<fij(Zd)).  This  means  that  the  total  storage  required  by  array  Aj  is  Ndj ,  and  the  total  storage  required  by  all 
arrays  is  J2JLi  Ndj ,  yielding  the  lower  bound  Ad  >  J2JLi  Ndi  /P,  and  so  altogether 


Jyd/SHBE 
pi/ shbl-1 


>  M  > 


E"li  Nd 


p 


f^rnax 
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where  dmax  :=  max^1  dj.  Plugging  these  bounds  on  M  into  the  communication  lower  bound  yields 
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The  expression  on  the  left  is  simply  the  memory- independent  lower  bound  fl((|Z|/P)1/SHBL)  discussed  in  Section  4.6. 
Note  that  when  Ad  equals  its  upper  bound,  it  also  equals  the  communication  lower  bound,  which  means  that  the 
content  of  the  memory  only  needs  to  be  communicated  once;  this  is  related  to  the  fact  that  the  assumption 
|Z|/P  =  u>(F)  fails  to  hold  (see  Section  4.6). 

For  example,  for  matrix  multiplication  (d  =  3,  srbl  =  3/2,  and  dmax  =  2)  we  get  the  familiar  constraints 
N2/p2/3  >  M  >  N2/p  and  N2/p2/3  <  ( N3/P)/M 1/2  <  N2 / P1/2  [3,  1,  13,  20], 

For  Ad  in  this  range,  we  may  express  our  parallel  algorithm  most  simply  as  follows: 


while  there  are  unexecuted  bricks 

assign  P  unexecuted  bricks  to  P  processors 

in  parallel,  communicate  the  needed  data  to  each  processor 

in  parallel,  execute  each  brick 

The  algorithm  is  clearly  load  balanced  (and  oblivious  to  any  loop  dependencies).  It  remains  to  compute  the  com¬ 
munication  complexity:  this  is  simply  the  number  of  times  each  processor  executes  a  brick,  times  the  communication 
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required  per  brick.  Since  there  are  Nd /APHBL  bricks  shared  evenly  by  P  processors,  with  0(M )  communication  per 
brick,  the  result  is  0((Ard/P)/MSHBL_1),  i.e.,  the  lower  bound  is  attained.  More  formally,  we  could  use  an  abstract 
model  of  computation  like  LPRAM  [1]  or  BSPRAM  [22,  18]  to  describe  this  algorithm. 

Instead,  by  making  more  assumptions  (satisfied  by  some  of  our  running  examples)  we  may  also  describe  the 
algorithms  in  more  detail  as  to  how  to  map  them  to  an  actual  machine  (we  will  weaken  these  assumptions  in  Part  2) . 
We  will  assume  that  A-x  =  lm,  not  just  A-a:  <  lm  as  required  in  (xLP).  By  choosing  all  x,:  =  l/drnax,  which  satisfies 
A  •  x  <  lm,  we  see  that  srbl  >  lj  ■  x  =  d/dm ax.  We  will  in  fact  assume  Shbl  =  d/dmax.  Finally,  we  will  assume  for 
convenience  that  certain  intermediate  expressions,  like  PSHBL,  are  integers.  We  will  write  M  =  CNdiaa-x /P,  where  C 
is  the  number  of  copies  of  the  data  we  will  use;  (7=1  corresponds  to  the  minimum  memory  necessary.  Combining 
with  the  upper  bound  on  M  from  (7.1),  we  get  7Vd/SHBL/P1/SHBL  >  CAdmax/P,  or  <7  <  P1-1/shbl_ 

Next,  we  divide  the  P  processors  into  C  groups  of  P/C  processors  each.  The  total  number  of  bricks  is 
Tlf=i  N/MXi  =  Ad/AFHBL  =  Ad/((7A<imax/P)SHBL  =  (P/(7)SHBL.  If  we  used  just  P/C  processors  to  process 
all  these  bricks  (P/C  at  a  time  in  parallel),  then  it  would  take  (P/C)SHBL“1  parallel  phases.  But  since  there  are  C 
groups  of  processors,  only  Pshbl-i^shbl  phases  are  needed. 

Now  we  describe  how  the  data  is  initially  laid  out  in  each  group  of  P/C  processors.  Since  we  assume  there 
is  enough  memory  P  •  M  for  C  copies  of  the  data,  each  group  of  P/C  processors  will  have  its  own  copy.  For 
each  array  A^,  1  <  /./  <  m,  we  index  the  processors  by  (jk  €  N  :  0  <  jk  <  N/MXk)kesfli  so  the  number  of 
processors  indexed  is  Ilfces  N/MXk  =  Nd>J  /M^k(=s»Xk  =  Nd'x/M,  since  we  have  assumed  A  •  x  =  lm.  Finally, 
Ad"  /M  <  Nd /M  =  P/C ,  so  we  index  all  the  processors  in  the  group  when  =  dmax.  (When  d M  <  dmax,  i.e., 
the  array  A M  has  fewer  than  the  maximum  number  of  dimensions,  then  it  is  smaller  and  so  needs  fewer  processors 
to  store.)  Since  AM  has  subscripts  given  by  the  indices  in  SM,  we  may  partition  AM  by  blocking  the  subscripts  A , 
for  k  £  Sfl ,  into  blocks  of  size  MXk,  so  that  each  block  contains  IlfceS  APfc  =  M  entries,  and  so  may  be  stored 
on  a  single  processor.  Then  block  ( jk)keS„  is  stored  on  the  processor  with  the  same  indices,  for  0  <  jk  <  N/MXk. 
This  indexing  will  make  it  easy  for  a  processor  to  find  the  data  it  needs,  given  the  brick  it  needs  to  execute. 

Next  we  describe  how  to  group  bricks  (of  the  iteration  space),  indexed  by  J  =  {jk  G  N  :  0  <  jk  <  N/MXk )/=1 , 
into  groups  of  P/C  to  be  done  in  each  phase.  We  define  a  “superbrick”  to  consist  of  (N/MXk)1^SHBB  contiguous 
bricks  in  each  direction,  containing  a  total  of  ]/[^=1(A/M:Cfc)1/SHBL  =  P/C  bricks.  Thus  one  superbrick  can  be 
executed  in  one  parallel  phase  by  one  group  of  P/C  processors,  and  C  superbricks  can  be  executed  in  one  parallel 
phase  by  all  P.  Given  the  indexing  described  in  the  last  paragraph,  the  index  J  of  each  brick  in  a  superbrick 
indicates  exactly  which  processor  in  a  group  owns  the  data  necessary  to  execute  it. 

We  still  need  to  assign  superbricks  to  groups  of  processors.  There  are  many  ways  to  do  this.  We  consider  two 
extreme  cases,  (7=1  and  C  =  P1”1/shbl_ 

When  (7  =  1,  there  is  just  one  group  of  processors,  and  there  are  PSHBL_1  superbricks,  each  containing  P 
bricks,  with  each  brick  assigned  to  a  processor  using  the  index  J .  Altogether  there  are  Pshbl-i  parallel  phases 
to  complete  the  execution.  We  emphasize  that  this  is  just  one  way  of  many  to  accomplish  parallelization;  see  the 
examples  below.  For  example,  for  matrix  multiplication,  this  corresponds  to  the  classical  so-called  ‘2D’  algorithms, 
where  the  P  processors  are  laid  out  in  a  P1/2-by-P1/2  grid. 

When  (7  equals  its  maximum  value  (7  =  P1_1/shbl;  then  there  are  just  C  superbricks  and  the  number  of 
parallel  phases  is  Pshbl-i^shbl  _  \  por  example,  for  matrix  multiplication  this  corresponds  to  a  so-called  ‘3D’ 
algorithm,  where  the  P  processors  are  laid  out  in  a  P1/3-by-P1/3-by-P1/3  grid  (see  [1]).  For  the  TV-body  problem, 
this  corresponds  to  a  ‘2D’  algorithm  (see  [8]). 

Only  two  of  our  running  examples  satisfy  the  assumptions  in  this  special  case.  As  we  mentioned  above,  we  will 
weaken  these  assumptions  in  Part  2  to  address  the  general  ‘product  case,’  as  described  in  Sections  6.3  and  7.2. 

Example:  Ar-Body  Simulation  and  Database  Join  (Part  3/3).  Our  assumptions  in  the  preceding  paragraphs 
apply  to  the  A-body  example,  with  x\  =  Xi  =  1  and  shbl  =  2  =  d/dmax.  (We  will  not  discuss  the  database  join 
example  here  because  it  does  not  satisfy  the  assumptions  above.)  A 

Example:  Matrix  Multiplication  (Part  5/5).  Here  our  assumptions  in  the  preceding  paragraphs  apply,  with 
xi  =  x2  =  x3  =  1/2  and  sHbl  =  3/2  =  d/dmax.  A 

Example:  Complicated  Code  (Part  4/4).  This  example  does  not  satisfy  the  assumptions  of  the  preceding 
paragraphs,  so  we  will  instead  consider  the  similar-looking  code  sample 

for  i\  =  1  :  N,  for  =  1  :  N,  for  *3  =  1  :  N,  for  *4  =  1:  A,  for  A  =  1  :  A,  for  A  =  1  :  A, 

Ai(A,  *3,  A)  +=  funci(A2(A,  A,  A),  A3(i2,  A,  A),  A*  (A,  A,  A)) 

A5(i2,  A,  A)  +=  func2(A6(A,  A,  A),  ^(A,  A,  A)) 
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where  we  have  added  the  additional  subscript  i$  to  the  access  of  A5,  leading  to 


and  the  solution  x\ 
d/dmax  =  6/3. 
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a>6  =  1/3,  so  now  our  previous  assumptions  are  satisfied:  A  •  x  =  1,  and  shbl  =  2  = 

A 


8  Conclusions 

Our  goal  has  been  to  extend  the  theory  of  communication  lower  bounds,  and  to  design  algorithms  that  attain  them. 
Motivated  by  a  geometric  approach  first  used  for  the  3  nested  loops  common  in  linear  algebra,  we  extended  these 
bounds  to  a  very  general  class  of  algorithms,  assuming  only  that  that  the  data  accessed  can  be  modeled  as  arrays 
with  subscripts  that  are  linear  functions  of  loop  indices.  This  lets  us  use  a  recent  result  in  functional  analysis 
to  bound  the  maximum  amount  of  data  reuse  ( F ),  which  leads  to  a  communication  lower  bound.  The  bound  is 
expressed  in  terms  of  a  linear  program  that  can  be  written  down  and  solved  clecidably.  For  many  programs  of 
practical  interest,  it  it  possible  to  design  a  communication-optimal  algorithm  that  attains  the  bound.  This  is  true, 
for  example,  when  all  the  array  subscripts  are  just  subsets  of  the  loop  indices.  This  includes,  among  other  examples, 
linear  algebra,  direct  A-body  interactions,  database  joins,  and  tensor  contractions. 

Part  2  of  this  paper  will  both  expand  the  classes  of  algorithms  discussed  in  Section  6  for  which  the  lower  bound 
can  be  easily  computed,  and  expand  the  discussion  of  Section  7  on  the  attainability  of  our  lower  bounds. 

Part  2  will  also  improve  the  algorithm  in  Section  5.1  to  improve  its  efficiency.  First,  we  will  show  that  it 
suffices  to  check  condition  (3.1)  for  a  subset  C  of  subgroups  H  of  Zd,  rather  than  all  subgroups.  In  this  discrete 
extension  of  [25,  Theorem  1.8],  previously  mentioned  in  Remark  5.2,  C  is  the  lattice  generated  by  the  kernels  of 
the  homomorphisms,  ker((/i), . . . ,  ker We  will  discuss  a  few  practical  cases  where  C  is  finite  and  the  approach 
simplifies  further:  this  includes  the  case  when  there  are  at  most  m  =  3  arrays  in  the  inner  loop,  as  well  as  a 
generalization  of  the  product  case  from  Section  6.3. 

Then,  we  discuss  a  few  other  cases  where  C  may  be  infinite,  but  we  know  in  advance  that  we  need  only  check  a 
certain  finite  subset  of  subgroups  H  from  C.  This  includes  the  case  when  all  the  arrays  are  1-  or  (d—  l)-dimensional, 
and  when  there  are  at  most  d  =  4  loops  in  the  loop  nest. 

We  then  return  to  the  question  of  attainability,  i.e.,  finding  an  optimal  algorithm.  We  discuss  two  inequalities, 
(3.2)  and  \E\  <  MSHBL,  in  which  equality  must  be  (asymptotically)  attainable  to  find  an  optimal  algorithm.  For  the 
former  inequality,  we  show  that  asymptotic  equality  is  always  attainable  by  a  family  of  cubes  in  a  critical  subgroup. 
For  the  latter  inequality,  we  give  a  special  case  where  asymptotic  equality  also  holds,  while  leaving  the  general  case 
open.  These  results  suggest  how  to  reorganize  code  to  obtain  an  optimal  algorithm;  we  discuss  practical  concerns, 
including  constants  hidden  in  the  asymptotic  bounds.  We  conclude  by  discussing  how  our  lower  bounds  depend  on 
the  presence  of  (inter-iteration)  data  dependencies. 
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